PR #1758

open

Record: PR #1738 + PreQuant TTT LR=1e-3 + Unfrozen — val_bpb 1.02767 (3-seed mean)

val_bpb

1.0277

Architecture

Transformer

Optimizer

—

Artifact Size

16MB

Training Techniques

Test-Time Training

full TTT

parameters: {"learning_rate":0.001,"freeze_blocks":0,"epochs":21,"phase":"pre-quant"}

LR Schedule

cosine decay

parameters: {"epochs":21}

Quantization

GPTQ

bits: null

scope: model

Evaluation

sliding window eval

parameters: {"stride":64}

Increased pre-quant TTT learning rate from 5e-4 to 1e-3
Unfroze all TTT blocks by changing freeze_blocks from 2 to 0
Reported 3-seed mean val_bpb of 1.02767
Demonstrated that the improvement comes from a better pre-quant TTT endpoint rather than changes to quantization or main training