PR #576

closed

Record: Train Larger, Quantize Harder - 33.6M params + int5 GPTQ / (val_bpb: 1.1164)

val_bpb
1.1164
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.6MB

Training Techniques

Quantization
int5 QAT + GPTQ
bits: 5
scope: all weights
Architecture
BigramHash
Uses BigramHash embedding/component with 8192 buckets.
parameters: {"size":8192}
XSA
Applies XSA in all layers.
parameters: {"layers":"all"}
MLP3.5x
Uses widened MLP hidden dimension.
parameters: {"hidden_dim":1792,"multiplier":3.5}
LeakyReLU²
Uses squared LeakyReLU activation.
parameters: {"negative_slope":0.5}
Partial RoPE
Uses partial rotary positional embeddings.
parameters: {"ratio":"16/64"}
U-Net skip connections
Adds skip connections in a U-Net-like pattern.
parameters: null
SmearGate
Includes SmearGate component.
parameters: null
LN Scale
Uses layer norm scaling.
parameters: null
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"lr":0.025}
AdamW
weight_decay: null
momentum: null
other_params: {"lr":0.0001,"used_for":"embeddings/scalars and TTT"}
Weight Averaging
EMA
parameters: {"decay":0.997}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.0001,"chunk_size":131000,"epochs":3,"temperature":1,"layers":"last 2 blocks"}
Other
other
Post-TTT temperature calibration to correct overconfidence and improve BPB.
parameters: {"temperature":0.98}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_iters":3500}
Regularization
2% pruning
parameters: {"pruning_fraction":0.02}

Novel Contributions

  • Train larger with a 33.6M parameter model while fitting within the 16MB limit via int5 quantization.
  • Full Hessian GPTQ quantization with int5 per-row export.
  • Post-TTT temperature calibration at T=0.98 to correct score-first TTT overconfidence.
  • Combines late QAT, EMA, pruning, and GPTQ to improve compression and performance.