PR #173

open

Record submission : Int6 + MLP 3x + Flash Attention 3 + NorMuon, val_bpb = 1.1532

by tamoghnokandarView on GitHub
val_bpb
1.1532
Architecture
Transformer
Optimizer
NorMuon
Artifact Size
15.96MB

Training Techniques

Quantization
int6
bits: 6
scope: weight matrices with per-row scaling; tied embedding and last 2 layers' c_k.weight kept in fp16
Architecture
MLP3x
Increased MLP hidden size from 1024 to 1536 (3x expansion).
parameters: {"hidden_size":1536,"base_hidden_size":1024}
tied embeddings
Kept tied token embedding in fp16 for sensitivity reasons.
parameters: null
KV head count
Model uses 4 KV heads with 8 attention heads.
parameters: {"heads":8,"kv_heads":4,"layers":9,"dim":512}
Optimizer
NorMuon
weight_decay: null
momentum: 0.99
other_params: {"matrix_lr":0.02,"scalar_lr":0.02,"tied_embed_lr":0.03,"muon_momentum_warmup_start":0.92,"muon_momentum_warmup_steps":1500}
Evaluation
sliding window eval
parameters: {"stride":256,"eval_seq_len":2048}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmdown_iters":3000}
Regularization
gradient clipping
parameters: {"grad_clip_norm":0.3}
Other
other
FlashAttention 3 used for the attention kernel to improve training/runtime on H100s.
parameters: null

Novel Contributions

  • Replaced Muon with NorMuon for optimizer updates.
  • Switched the attention path to FlashAttention 3.
  • Used int6 post-training quantization with per-row scaling to fit a larger MLP.
  • Expanded the MLP hidden size from 1024 to 1536 while staying within the artifact budget.
  • Validated the submission across three seeds (7, 42, 1337) with sliding-window evaluation.