PR #2069

open

Non Record: 4xH100 - val_bpb: 1.26066159, QK5.25 TTT-disabled non-record submission

by tenet-diverView on GitHub
val_bpb
1.2607
Architecture
Transformer
Optimizer
Artifact Size
15080366 bytes

Training Techniques

Architecture
QK-gain
Uses a QK-gain 5.25 autoregressive control stack
parameters: {"qk_gain_init":5.25}
Test-Time Training
TTT disabled
parameters: null
Sequence Length
sequence_length
train_length: 2097152
eval_length: null

Novel Contributions

  • Primary non-record 16MB submission from the deadline search
  • 4xH100 promotion of the best 1xH100 TTT-disabled QK5.25 control
  • Comparison against legal TTT, parallel-residual + legal TTT, and dense optimizer baseline
  • Demonstrates that disabling TTT slightly outperformed legal TTT for this SP1024 setup