val_bpb
1.0909
Architecture
Transformer
Optimizer
Muon
Artifact Size
13.99 MB
Training Techniques
Quantization
GPTQ
bits: 5
scope: matrix weights
Architecture
MLP
Widened the MLP from 4.0x to 4.8x in one experiment to trade quantization savings for extra capacity.
parameters: {"multiplier":4.8}
depth recurrence
Uses depth recurrence as part of the base SOTA stack.
parameters: {"layers":11,"loops":2,"virtual_layers":17}
XSA
Uses XSA attention component in the base architecture.
parameters: null
Partial RoPE
Applies RoPE only to part of the dimensions.
parameters: {"ratio":"16/64"}
LeakyReLU
Uses LeakyReLU squared activation.
parameters: {"squared":true,"negative_slope":0.5}
KV head count
Uses fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"variant":"MuonEq-R"}
Test-Time Training
score-first TTT
parameters: null
Evaluation
sliding window eval
parameters: null
Regularization
skip gates
parameters: null
Novel Contributions
- Clean ablation showing int5 GPTQ is worse than int6 with the same model size
- Demonstrated that int5 quantization gap is about 2x worse than int6
- Tested whether freeing parameter budget with int5 could support a wider MLP
- Showed that a wider MLP slows training enough that capacity gains do not offset the quantization penalty
- Documented that int5 GPTQ is not viable for this competition under the current setup