PR #1954

open

Add SP8192 QK525 TTT006 under-16MB submission

by Syed-M-ZeeshanView on GitHub
val_bpb
1.0811
Architecture
Transformer
Optimizer
Artifact Size
under 16MB

Training Techniques

Sequence Length
sequence_length
train_length: 8192
eval_length: null
Test-Time Training
TTT
parameters: {"learning_rate":0.006,"epochs":3}
Other
other
QK gain initialization set to 5.25
parameters: {"qk_gain_init":5.25}

Novel Contributions

  • 3-seed under-16MB submission
  • Test-time training with TTT enabled
  • QK gain initialization of 5.25
  • All artifacts kept under the 16,000,000 byte cap