PR #39

RECORDclosed

Record: 10L Mixed Precision: val_bpb=1.2147 (10 layers + int6 middle layers)

by nanlliuView on GitHub
val_bpb
1.2139
Architecture
Transformer
Optimizer
Muon/Adam
Artifact Size
15.93MB

Training Techniques

Quantization
mixed int6/int8
bits: 6
scope: middle layers 3-6 int6; first/last 3 layers int8
Architecture
depth / layer count
Increased the Transformer depth from 9 layers to 10 layers.
parameters: {"layers":10}
tied embeddings
Uses tied input/output embeddings.
parameters: null
KV head count
Transformer configuration includes 4 KV heads.
parameters: {"kv_heads":4}
Optimizer
Muon/Adam
weight_decay: null
momentum: null
other_params: {"matrix_lr":0.02,"scalar_lr":0.02,"tied_embed_lr":0.03}
Compression
zlib
level: null

Novel Contributions

  • Mixed-precision compression using int8 for early/late layers and int6 for middle layers to fit under the 16MB limit.
  • Increased model depth from 9 to 10 transformer layers while staying within the artifact budget.
  • Lowered learning rates substantially from the default settings to improve post-quantization validation performance.
  • Demonstrated multi-seed robustness with all five seeds beating the prior benchmark and achieving p < 0.001.