PR #204

open

Add record: INT6 10L SWA NorMuon, val_bpb=1.2320

val_bpb
1.2320
Architecture
GPT
Optimizer
NorMuon
Artifact Size
14.2MB

Training Techniques

Quantization
int6
bits: 6
scope: all model weights
Architecture
tied embeddings
Uses tied input/output embeddings.
parameters: null
KV head count
Uses grouped-query attention with fewer KV heads than attention heads.
parameters: {"layers":10,"model_dim":512,"num_heads":8,"num_kv_heads":4,"mlp_hidden":1088}
Optimizer
NorMuon
weight_decay: 0.02
momentum: null
other_params: {"beta2":0.95}
Weight Averaging
SWA
parameters: {"snapshots":50,"every_steps":200}
Compression
zlib
level: 9
Evaluation
sliding window eval
parameters: {"stride":64,"batch_seqs":32,"context_length":4096}
Sequence Length
sequence_length
train_length: 2048
eval_length: 4096
LR Schedule
warmdown
parameters: {"warmdown_iters":20000}
Regularization
weight decay
parameters: {"value":0.02}
Other
other
Aggressive warmdown from step 0 to encourage tighter weight distributions for quantization.
parameters: {"warmdown_iters":20000}

Novel Contributions

  • INT6 quantization enabling a larger 10-layer architecture within the 16MB budget
  • Stochastic Weight Averaging with 50 snapshots before quantization
  • NorMuon optimizer with decoupled weight decay
  • Aggressive warmdown schedule starting from step 0
  • Use of NTK RoPE evaluation at 4096 context, though it degraded post-quant performance