PR #1758

open

Record: PR #1738 + PreQuant TTT LR=1e-3 + Unfrozen — val_bpb 1.02767 (3-seed mean)

by kilojoulesView on GitHub
val_bpb
1.0277
Architecture
Transformer
Optimizer
Artifact Size
16MB

Training Techniques

Test-Time Training
full TTT
parameters: {"learning_rate":0.001,"freeze_blocks":0,"epochs":21,"phase":"pre-quant"}
LR Schedule
cosine decay
parameters: {"epochs":21}
Quantization
GPTQ
bits: null
scope: model
Evaluation
sliding window eval
parameters: {"stride":64}

Novel Contributions

  • Increased pre-quant TTT learning rate from 5e-4 to 1e-3
  • Unfroze all TTT blocks by changing freeze_blocks from 2 to 0
  • Reported 3-seed mean val_bpb of 1.02767
  • Demonstrated that the improvement comes from a better pre-quant TTT endpoint rather than changes to quantization or main training