PR #86

RECORDclosed

Update: 11L MLP3x + WD=0.04 + zstd-22 (val_bpb 1.1502)

by aruniyerView on GitHub
val_bpb
1.1502
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.4MB

Training Techniques

Architecture
MLP3x
Widened MLP expansion to 3x for more capacity per layer.
parameters: {"mlp_mult":3,"hidden":1536}
tied embeddings
Uses tied input/output embeddings with FP16 export for the embedding/head.
parameters: {"vocab_size":1024}
KV head count
Uses grouped-query attention with 4 KV heads.
parameters: {"attention_heads":8,"kv_heads":4}
RoPE
Uses rotary positional embeddings in attention.
parameters: null
Quantization
STE QAT int6
bits: 6
scope: all block weights
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}
AdamW
weight_decay: 0.04
momentum: null
other_params: null
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Regularization
weight decay
parameters: {"muon_weight_decay":0.04,"adam_weight_decay":0.04}
LR Schedule
warmdown
parameters: {"warmdown_iters":3000,"warmup_steps":1500}

Novel Contributions

  • 11-layer transformer with 3x MLP expansion
  • Int6 quantization-aware training with STE fake quantization
  • Decoupled weight decay of 0.04 on both Muon and AdamW
  • FP16 tied embedding export to preserve embedding/head quality
  • zstd-22 compression to fit the larger model under the 16MB limit
  • Sliding window evaluation with stride 64 for improved val_bpb
  • Higher Muon momentum with warmup from 0.92 to 0.99