PR #1727

open

Record: SP8192 MP-SGD TTT (4 phases) + QK-Gain 5.25 — val_bpb 1.07217 (3-seed mean)

val_bpb

1.0722

Architecture

Transformer

Optimizer

SGD

Artifact Size

15,938,690 B

Training Techniques

Test-Time Training

score-first TTT

parameters: {"phases":4,"enabled":true}

Optimizer

SGD

weight_decay: null

momentum: null

other_params: {"matrix_lr":0.026}

Quantization

GPTQ

bits: null

scope: model weights

Sequence Length

sequence_length

train_length: 8192

eval_length: 8192

Architecture

depth recurrence

Multi-phase global SGD / phased LoRA TTT stack with repeated adaptation passes at evaluation time.

parameters: {"phases":4}

Other

other

QK-Gain initialization tuned to 5.25 to match merged SOTA.

parameters: {"qk_gain_init":5.25}

Extended the Multi-Phase Global SGD + Phased LoRA TTT stack from PR #1700 from 3 phases to 4 phases.
Tuned QK_GAIN_INIT to 5.25, matching merged SOTA PR #1493.
Demonstrated that a 4th MP-SGD phase fits within the evaluation budget.
Reported a 3-seed mean val_bpb of 1.07217 on Track A.