val_bpb
1.0586
Architecture
Transformer
Optimizer
—
Artifact Size
15,983,413 bytes
Training Techniques
Test-Time Training
score-first TTT
parameters: {"scope":"K and O LoRA adapters only","mlp_lora":0}
Sequence Length
sequence_length
train_length: null
eval_length: 2688
Initialization
QK_GAIN_INIT
QK gain initialization set to 5.35
Regularization
weight decay
parameters: {"local_lr_mult":0.75}
Novel Contributions
- PR #1953-style long-context score-first phased TTT stack
- TTT restricted to K and O LoRA adapters
- TTT_MLP_LORA disabled
- QK_GAIN_INIT increased to 5.35
- Evaluation sequence length set to 2688