PR #1351

open

Record: Discriminative TTT — val_bpb 1.0807 (3-seed mean)

by resouerView on GitHub
val_bpb
1.0807
Architecture
Transformer
Optimizer
AdamW
Artifact Size
~15.8 MB

Training Techniques

Test-Time Training
full TTT
parameters: {"timing":"pre-quantization","epochs":10,"freeze":0,"per_block_lr_scaling":{"start":0.3,"end":1,"interpolation":"linear"}}
Optimizer
AdamW
weight_decay: null
momentum: null
other_params: {"phase":"pre-quant TTT"}
Muon
weight_decay: null
momentum: null
other_params: {"parallel":true}
Quantization
GPTQ
bits: 6
scope: all
Architecture
BigramHash
Base model uses BigramHash 2048x128
parameters: {"dimensions":2048,"width":128}
XSA
Base model uses XSA-all attention variant
parameters: null
GQA
Base model uses grouped query attention
parameters: null
LR Schedule
warmdown
parameters: {"warmdown_steps":4000}
Evaluation
sliding window eval
parameters: {"stride":64}

Novel Contributions

  • Discriminative test-time training with per-block adaptive learning rates
  • Linear LR interpolation across transformer blocks from 0.3x to 1.0x
  • Freeze=0 all-block adaptation during pre-quant TTT
  • Coprime-stride multi-shard data loader with weighted random shard sampling
  • Improved configuration using QK_GAIN=5.0, WARMDOWN=4000, and GPTQ damp=0.005