PR #1129

open

Record: CROWN-Q + GPTQ + Legal TTT — val_bpb 1.1174 (3-seed mean)

by EthanYangTWView on GitHub
val_bpb
1.1174
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,961,751 bytes

Training Techniques

Architecture
GQA
Grouped query attention
parameters: {"layers":11,"kv_heads":4,"query_heads":8}
XSA
XSA applied to all layers
parameters: {"layers":11}
BigramHash
Bigram hash embedding
parameters: {"dimensions":2048}
Partial RoPE
Partial rotary positional embeddings
parameters: {"train_fraction":16,"total_fraction":64}
SmearGate
SmearGate activation/gating component
parameters: null
OrthoInit
Orthogonal initialization
parameters: null
ReLU²
Squared ReLU MLP activation
parameters: null
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"start":"last ~150 steps"}
Quantization
late QAT
bits: 6
scope: all
GPTQ
bits: 6
scope: all
Regularization
LN scale
parameters: null
CROWN-Q
parameters: {"description":"Curvature-weighted quantization variance penalty during warmdown"}
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: {"learning_rate":0.025}
LR Schedule
warmdown
parameters: {"shape":"sqrt","description":"holds learning rate higher longer during warmdown"}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.0001,"epochs":3,"unfrozen_blocks":2}
Evaluation
sliding window eval
parameters: {"stride":32}
Compression
zstd
level: 22

Novel Contributions

  • CROWN-Q curvature-weighted quantization penalty during warmdown
  • Full Cholesky GPTQ with act-order calibrated on training data only
  • Score-first legal TTT where each token is scored before any gradient update
  • Sqrt cooldown schedule that keeps learning rate higher during warmdown
  • Combined post-quantization TTT pipeline achieving 1.1174 val_bpb