PR #198

RECORDopen

11-Layer Int6 + WD=0.04 + SWA + FA3 (val_bpb: 1.1318)

by jfprinczView on GitHub

val_bpb

1.1318

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.7 MB

Training Techniques

Quantization

mixed int6/int8

bits: 6

scope: MLP and attention int6; embeddings int8

Architecture

MLP3x

Uses a 3x MLP with hidden size 1536 and relu² activation.

parameters: {"hidden_size":1536}

SmearGate

Learned token blending gate added to the residual stream.

parameters: {"parameters":512}

BigramHash

Bigram hash embedding for token-pair features into the residual stream.

parameters: {"bigram_vocab_size":2048}

RoPE

Sequence uses NTK-aware RoPE.

parameters: null

FlashAttention 3

Uses direct flash_attn_func calls for attention.

parameters: null

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"muon_momentum_warmup_start":0.92,"muon_momentum_warmup_steps":1500,"warmdown_iters":3000,"adamw_weight_decay":0.04}

Weight Averaging

SWA

parameters: {"checkpoint_avg_count":8,"warmdown_lr_scale_threshold":0.5,"checkpoint_interval_steps":200}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Initialization

OrthoInit

Orthogonal plus muP-scaled initialization on large matrices.

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

LR Schedule

warmdown

parameters: {"warmdown_iters":3000,"warmup_steps":1500}

Regularization

weight decay

parameters: {"muon_wd":0.04,"adamw_wd":0.04}

Novel Contributions

Increased depth to 11 transformer layers to gain capacity while staying under the artifact limit via int6 compression.
Applied weight decay 0.04 to keep weights quantization-friendly and improve int6 compression.
Used stochastic weight averaging over roughly 8 checkpoints during warmdown.
Evaluated with sliding-window stride 64 for near-full context scoring.
Reduced bigram vocabulary from 4096 to 2048 to save artifact space with minimal BPB impact.
Kept and combined prior techniques including OrthoInit + muP, 3x MLP, SmearGate, BigramHash, and FlashAttention 3.