PR #1727
openRecord: SP8192 MP-SGD TTT (4 phases) + QK-Gain 5.25 — val_bpb 1.07217 (3-seed mean)
by yahya010View on GitHub
val_bpb
1.0722
Architecture
Transformer
Optimizer
SGD
Artifact Size
15,938,690 B
Training Techniques
Test-Time Training
score-first TTT
parameters: {"phases":4,"enabled":true}
Optimizer
SGD
weight_decay: null
momentum: null
other_params: {"matrix_lr":0.026}
Quantization
GPTQ
bits: null
scope: model weights
Sequence Length
sequence_length
train_length: 8192
eval_length: 8192
Architecture
depth recurrence
Multi-phase global SGD / phased LoRA TTT stack with repeated adaptation passes at evaluation time.
parameters: {"phases":4}
Other
other
QK-Gain initialization tuned to 5.25 to match merged SOTA.
parameters: {"qk_gain_init":5.25}
Novel Contributions
- Extended the Multi-Phase Global SGD + Phased LoRA TTT stack from PR #1700 from 3 phases to 4 phases.
- Tuned QK_GAIN_INIT to 5.25, matching merged SOTA PR #1493.
- Demonstrated that a 4th MP-SGD phase fits within the evaluation budget.
- Reported a 3-seed mean val_bpb of 1.07217 on Track A.