PR #1735

open

Record: SP8192 + Parallel Pre-Quant TTT — val_bpb 1.0429 (3-seed mean)

by AjAnuboluView on GitHub
val_bpb
1.0429
Architecture
Transformer
Optimizer
AdamW
Artifact Size
~15.99 MB

Training Techniques

Test-Time Training
full TTT
parameters: {"enabled":true,"epochs":21,"parallel_gpus":8,"pre_quant":true}
Optimizer
AdamW
weight_decay: 0
momentum: null
other_params: {"pre_quant_ttt":true}
LR Schedule
cosine decay
parameters: {"scope":"epoch-level","t_max":21,"eta_min_ratio":0.1}
Quantization
GPTQ
bits: 6
scope: full model
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: {"stride":64,"single_pass":true}
Weight Averaging
EMA
parameters: null
Architecture
depth recurrence
3-layer depth recurrence with virtual layers
parameters: {"layers":3,"virtual_layers":17}
weight tying
tied embeddings / embedding tying
parameters: null

Novel Contributions

  • 8-GPU parallel pre-quant AdamW TTT using federated averaging across ranks
  • Epoch-level cosine learning-rate schedule across 21 TTT epochs
  • torch.compile acceleration for TTT forward pass
  • Fixed predictor with no eval-time adaptation, SLOT, or n-gram cache
  • GPTQ int6 artifact under the 16MB limit