PR #1646

open

Non-record: Int5 GPTQ + Wider MLP

by sergeevii123View on GitHub
val_bpb
1.0909
Architecture
Transformer
Optimizer
Muon
Artifact Size
13.99 MB

Training Techniques

Quantization
GPTQ
bits: 5
scope: matrix weights
Architecture
MLP
Widened the MLP from 4.0x to 4.8x in one experiment to trade quantization savings for extra capacity.
parameters: {"multiplier":4.8}
depth recurrence
Uses depth recurrence as part of the base SOTA stack.
parameters: {"layers":11,"loops":2,"virtual_layers":17}
XSA
Uses XSA attention component in the base architecture.
parameters: null
Partial RoPE
Applies RoPE only to part of the dimensions.
parameters: {"ratio":"16/64"}
LeakyReLU
Uses LeakyReLU squared activation.
parameters: {"squared":true,"negative_slope":0.5}
KV head count
Uses fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"variant":"MuonEq-R"}
Test-Time Training
score-first TTT
parameters: null
Evaluation
sliding window eval
parameters: null
Regularization
skip gates
parameters: null

Novel Contributions

  • Clean ablation showing int5 GPTQ is worse than int6 with the same model size
  • Demonstrated that int5 quantization gap is about 2x worse than int6
  • Tested whether freeing parameter budget with int5 could support a wider MLP
  • Showed that a wider MLP slows training enough that capacity gains do not offset the quantization penalty
  • Documented that int5 GPTQ is not viable for this competition under the current setup