PR #1947

open

Negative result: qTTT + AdamW on PR #1493 base (val_bpb 1.08902)

val_bpb

1.0890

Architecture

Transformer

Optimizer

AdamW

Artifact Size

15,992,907 bytes

Training Techniques

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.001,"epochs":3}

Optimizer

AdamW

weight_decay: 0.01

momentum: null

other_params: null

Evaluation

sliding window eval

parameters: {"stride":64}

Architecture

depth recurrence

3-layer depth recurrence / looped layers in the base stack

parameters: {"layers":3}

weight tying

shared weights across repeated virtual layers in the recurrent stack

parameters: null

Initialization

QK-Gain

QK-Gain initialization set to 5.5

Regularization

weight decay

parameters: {"value":0.01}

Other

other

qTTT / query-only test-time training that adapts only c_q.weight

parameters: {"query_only":true,"parameter_scope":"c_q.weight"}

other

post-TTT temperature scaling applied during evaluation

parameters: {"temperature":0.98}

Compression

lzma

level: null

Documented negative result showing qTTT + AdamW does not beat the merged SOTA on the PR #1493 base
Ablation demonstrating that AdamW with lr=0.001 catastrophically degrades full TTT on the quantized depth-recurrent model
Ablation showing qTTT (query-only adaptation of c_q.weight) mitigates but does not fully recover the AdamW regression
Evaluation of four eval-time modifications together: AdamW TTT, qTTT, post-TTT temperature 0.98, and QK-Gain init 5.5
Submission packaged with self-decompressing lzma+base85 source to fit the artifact limit