PR #179

open

Record: 11L, int6+zstd, decoupled WD (val_bpb = 1.1472)

by devin-cogView on GitHub
val_bpb
1.1472
Architecture
GPT
Optimizer
Muon
Artifact Size
15,905,331 bytes

Training Techniques

Quantization
int6
bits: 6
scope: MLP and attention weights; embeddings kept in fp16
Architecture
GQA / KV head count
GPT with grouped-query attention using fewer KV heads than attention heads
parameters: {"layers":11,"num_heads":8,"num_kv_heads":4}
Optimizer
Muon
weight_decay: 0.038
momentum: 0.99
other_params: {"matrix_lr":0.025,"scalar_lr":0.025,"tied_embed_lr":0.03,"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmup and warmdown
parameters: {"warmup_steps":1500,"warmdown_steps":3000}
Regularization
weight decay
parameters: {"weight_decay":0.038}
Other
other
Val-only training on the validation shard for the non-record aside submission
parameters: {"train_files":"fineweb_val_*.bin"}

Novel Contributions

  • Decoupled weight decay on Muon to reduce quantization gap
  • 11-layer GPT with GQA to fit under 16MB
  • Int6 per-row quantization with fp16 embeddings
  • Sliding window evaluation with stride 64
  • Higher learning rate / tuned Muon settings for improved convergence
  • Val-only training aside demonstrating the approach