PR #1087
openRecord: 1.1407 BPB — LeakyReLU^2 + Delayed QAT + Score-First TTT
by DhenenjayView on GitHub
val_bpb
1.1407
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.7 MB
Training Techniques
Quantization
STE QAT
bits: 5
scope: MLP
STE QAT
bits: 6
scope: attention
GPTQ-lite
bits: null
scope: all
late QAT
bits: 5
scope: MLP
Architecture
LeakyReLU
Uses LeakyReLU(0.5)^2 in the MLP.
parameters: {"slope":0.5,"squared":true}
Partial RoPE
Applies RoPE to only part of the head dimensions.
parameters: {"dimensions":16,"total_dimensions":64}
BigramHash
Uses a bigram hash embedding component.
parameters: {"vocab_size":6144}
SmearGate
Includes SmearGate in the architecture.
parameters: null
Regularization
LN scale
parameters: {"scale":"1/sqrt(l+1)"}
Weight Averaging
EMA + SWA
parameters: {"ema_decay":0.997,"swa_every":50}
Optimizer
SGD
weight_decay: null
momentum: 0.95
other_params: {"learning_rate":0.005}
AdamW
weight_decay: null
momentum: null
other_params: {"learning_rate":0.035}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.005,"epochs":3,"chunk_tokens":16384,"freeze_blocks":0}
Evaluation
sliding window eval
parameters: {"stride":64}
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Novel Contributions
- Delayed QAT with quantization noise injected only after step 5500
- Score-first TTT with strong post-training gains
- LeakyReLU(0.5)^2 combined with Partial RoPE and LN Scale
- GPTQ-lite per-row optimal clip percentile search
- EMA and SWA combined with delayed quantization-aware training