PR #1727

open

Record: SP8192 MP-SGD TTT (4 phases) + QK-Gain 5.25 — val_bpb 1.07217 (3-seed mean)

by yahya010View on GitHub
val_bpb
1.0722
Architecture
Transformer
Optimizer
SGD
Artifact Size
15,938,690 B

Training Techniques

Test-Time Training
score-first TTT
parameters: {"phases":4,"enabled":true}
Optimizer
SGD
weight_decay: null
momentum: null
other_params: {"matrix_lr":0.026}
Quantization
GPTQ
bits: null
scope: model weights
Sequence Length
sequence_length
train_length: 8192
eval_length: 8192
Architecture
depth recurrence
Multi-phase global SGD / phased LoRA TTT stack with repeated adaptation passes at evaluation time.
parameters: {"phases":4}
Other
other
QK-Gain initialization tuned to 5.25 to match merged SOTA.
parameters: {"qk_gain_init":5.25}

Novel Contributions

  • Extended the Multi-Phase Global SGD + Phased LoRA TTT stack from PR #1700 from 3 phases to 4 phases.
  • Tuned QK_GAIN_INIT to 5.25, matching merged SOTA PR #1493.
  • Demonstrated that a 4th MP-SGD phase fits within the evaluation budget.
  • Reported a 3-seed mean val_bpb of 1.07217 on Track A.