val_bpb
1.0811
Architecture
Transformer
Optimizer
—
Artifact Size
under 16MB
Training Techniques
Sequence Length
sequence_length
train_length: 8192
eval_length: null
Test-Time Training
TTT
parameters: {"learning_rate":0.006,"epochs":3}
Other
other
QK gain initialization set to 5.25
parameters: {"qk_gain_init":5.25}
Novel Contributions
- 3-seed under-16MB submission
- Test-time training with TTT enabled
- QK gain initialization of 5.25
- All artifacts kept under the 16,000,000 byte cap