PR #198

RECORDopen

11-Layer Int6 + WD=0.04 + SWA + FA3 (val_bpb: 1.1318)

by jfprinczView on GitHub
val_bpb
1.1318
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.7 MB

Training Techniques

Quantization
mixed int6/int8
bits: 6
scope: MLP and attention int6; embeddings int8
Architecture
MLP3x
Uses a 3x MLP with hidden size 1536 and relu² activation.
parameters: {"hidden_size":1536}
SmearGate
Learned token blending gate added to the residual stream.
parameters: {"parameters":512}
BigramHash
Bigram hash embedding for token-pair features into the residual stream.
parameters: {"bigram_vocab_size":2048}
RoPE
Sequence uses NTK-aware RoPE.
parameters: null
FlashAttention 3
Uses direct flash_attn_func calls for attention.
parameters: null
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"muon_momentum_warmup_start":0.92,"muon_momentum_warmup_steps":1500,"warmdown_iters":3000,"adamw_weight_decay":0.04}
Weight Averaging
SWA
parameters: {"checkpoint_avg_count":8,"warmdown_lr_scale_threshold":0.5,"checkpoint_interval_steps":200}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Initialization
OrthoInit
Orthogonal plus muP-scaled initialization on large matrices.
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmdown_iters":3000,"warmup_steps":1500}
Regularization
weight decay
parameters: {"muon_wd":0.04,"adamw_wd":0.04}

Novel Contributions

  • Increased depth to 11 transformer layers to gain capacity while staying under the artifact limit via int6 compression.
  • Applied weight decay 0.04 to keep weights quantization-friendly and improve int6 compression.
  • Used stochastic weight averaging over roughly 8 checkpoints during warmdown.
  • Evaluated with sliding-window stride 64 for near-full context scoring.
  • Reduced bigram vocabulary from 4096 to 2048 to save artifact space with minimal BPB impact.
  • Kept and combined prior techniques including OrthoInit + muP, 3x MLP, SmearGate, BigramHash, and FlashAttention 3.