PR #1758
openRecord: PR #1738 + PreQuant TTT LR=1e-3 + Unfrozen — val_bpb 1.02767 (3-seed mean)
by kilojoulesView on GitHub
val_bpb
1.0277
Architecture
Transformer
Optimizer
—
Artifact Size
16MB
Training Techniques
Test-Time Training
full TTT
parameters: {"learning_rate":0.001,"freeze_blocks":0,"epochs":21,"phase":"pre-quant"}
LR Schedule
cosine decay
parameters: {"epochs":21}
Quantization
GPTQ
bits: null
scope: model
Evaluation
sliding window eval
parameters: {"stride":64}
Novel Contributions
- Increased pre-quant TTT learning rate from 5e-4 to 1e-3
- Unfroze all TTT blocks by changing freeze_blocks from 2 to 0
- Reported 3-seed mean val_bpb of 1.02767
- Demonstrated that the improvement comes from a better pre-quant TTT endpoint rather than changes to quantization or main training