PR #478
openNew SOTA: 1.12676 BPB - 11L XSA-all(11) + GPTQ-lite + EMA + Late QAT
by gowtham0992View on GitHub
val_bpb
1.1268
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.7 MB
Training Techniques
Architecture
XSA
Exclusive Self Attention applied to all 11 layers instead of only the last few layers.
parameters: {"layers":11}
Partial RoPE
Rotary positional embeddings applied to only part of the dimensions with NTK-aware scaling.
parameters: {"dimensions":16,"total_dimensions":64}
SmearGate
Additional gating mechanism used in the architecture.
parameters: null
BigramHash
Hash-based bigram feature module with learned embeddings.
parameters: {"buckets":2048,"dimension":128}
tied embeddings
Input and output embeddings are tied.
parameters: null
Quantization
GPTQ-lite
bits: 6
scope: all large weights
QAT
bits: 6
scope: all
int8
bits: 8
scope: embeddings
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"frequency":50,"start_condition":"scale<0.2"}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Initialization
OrthoInit
Orthogonal initialization with muP-scaled output projections.
LR Schedule
warmdown
parameters: {"warmdown_iterations":3500}
Regularization
layerwise LN scale
parameters: {"scale_rule":"1/sqrt(layer_idx+1)"}
Other
other
Late QAT with int6 STE fake-quantization when LR scale drops below 0.15.
parameters: {"lr_scale_threshold":0.15}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"lr":0.025,"warmup_momentum":"0.92->0.99 over 1500 steps"}
AdamW
weight_decay: 0.04
momentum: null
other_params: {"lr_embeddings":0.035,"lr_scalars":0.025}
Novel Contributions
- XSA applied to all 11 layers
- GPTQ-lite optimal clip percentile search
- EMA with tight SWA
- Late QAT int6-all triggered at low learning-rate scale
- Raw binary serialization with zstd level 22 compression
- Removal of Backout mechanism improved compression quality
- No pruning required for int6-all fitting under the size limit