PR #1947

open

Negative result: qTTT + AdamW on PR #1493 base (val_bpb 1.08902)

by phfarathView on GitHub
val_bpb
1.0890
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15,992,907 bytes

Training Techniques

Test-Time Training
score-first TTT
parameters: {"learning_rate":0.001,"epochs":3}
Optimizer
AdamW
weight_decay: 0.01
momentum: null
other_params: null
Evaluation
sliding window eval
parameters: {"stride":64}
Architecture
depth recurrence
3-layer depth recurrence / looped layers in the base stack
parameters: {"layers":3}
weight tying
shared weights across repeated virtual layers in the recurrent stack
parameters: null
Initialization
QK-Gain
QK-Gain initialization set to 5.5
Regularization
weight decay
parameters: {"value":0.01}
Other
other
qTTT / query-only test-time training that adapts only c_q.weight
parameters: {"query_only":true,"parameter_scope":"c_q.weight"}
other
post-TTT temperature scaling applied during evaluation
parameters: {"temperature":0.98}
Compression
lzma
level: null

Novel Contributions

  • Documented negative result showing qTTT + AdamW does not beat the merged SOTA on the PR #1493 base
  • Ablation demonstrating that AdamW with lr=0.001 catastrophically degrades full TTT on the quantized depth-recurrent model
  • Ablation showing qTTT (query-only adaptation of c_q.weight) mitigates but does not fully recover the AdamW regression
  • Evaluation of four eval-time modifications together: AdamW TTT, qTTT, post-TTT temperature 0.98, and QK-Gain init 5.5
  • Submission packaged with self-decompressing lzma+base85 source to fit the artifact limit