PR #469
closedNon-record: 27M params at Int5 QAT / train larger, quantize harder (val_bpb=1.1418)
by cmcdndView on GitHub
val_bpb
1.1418
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.7 MB
Training Techniques
Quantization
int5
bits: 5
scope: MLP and attention weights
QAT
bits: 5
scope: all
Architecture
Partial RoPE
Uses rotary position embeddings on only part of the dimensions
parameters: {"dimensions":"16/64"}
XSA
Applies XSA in the last 4 layers
parameters: {"layers":4}
SmearGate
Uses SmearGate activation/module
parameters: null
BigramHash
Adds BigramHash feature module
parameters: {"size":4096,"dim":128}
MLP3x
Uses 3x MLP expansion
parameters: {"hidden":1728}
KV head count
Uses grouped-query attention with fewer KV heads than attention heads
parameters: {"heads":9,"kv_heads":3}
U-Net skips
Uses U-Net style skip connections
parameters: null
Initialization
OrthoInit
Orthogonal initialization with muP-scaled output projections
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: null
Weight Averaging
SWA
parameters: null
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Regularization
weight decay
parameters: {"weight_decay":0.04}
Other
other
Early activation of int5 STE fake-quantization when lr_scale < 0.50, giving about 1,700 adaptation steps
parameters: {"threshold":0.5,"adaptation_steps":1700}
Novel Contributions
- Train a larger 27M-parameter model at the same artifact budget by using more aggressive int5 quantization instead of int6.
- Activate QAT much earlier (threshold 0.50) to allow substantially more adaptation time for the coarser 32-level quantization grid.
- Demonstrate that training larger and quantizing harder can outperform the standard smaller int6 approach at similar artifact size.