PR #2119

open

Non-record: PR1953 K+O-only TTT + QK_GAIN_INIT=5.35

by dexhunterView on GitHub
val_bpb
1.0586
Architecture
Transformer
Optimizer
Artifact Size
15,983,413 bytes

Training Techniques

Test-Time Training
score-first TTT
parameters: {"scope":"K and O LoRA adapters only","mlp_lora":0}
Sequence Length
sequence_length
train_length: null
eval_length: 2688
Initialization
QK_GAIN_INIT
QK gain initialization set to 5.35
Regularization
weight decay
parameters: {"local_lr_mult":0.75}

Novel Contributions

  • PR #1953-style long-context score-first phased TTT stack
  • TTT restricted to K and O LoRA adapters
  • TTT_MLP_LORA disabled
  • QK_GAIN_INIT increased to 5.35
  • Evaluation sequence length set to 2688