PR #2069
openNon Record: 4xH100 - val_bpb: 1.26066159, QK5.25 TTT-disabled non-record submission
by tenet-diverView on GitHub
val_bpb
1.2607
Architecture
Transformer
Optimizer
—
Artifact Size
15080366 bytes
Training Techniques
Architecture
QK-gain
Uses a QK-gain 5.25 autoregressive control stack
parameters: {"qk_gain_init":5.25}
Test-Time Training
TTT disabled
parameters: null
Sequence Length
sequence_length
train_length: 2097152
eval_length: null
Novel Contributions
- Primary non-record 16MB submission from the deadline search
- 4xH100 promotion of the best 1xH100 TTT-disabled QK5.25 control
- Comparison against legal TTT, parallel-residual + legal TTT, and dense optimizer baseline
- Demonstrates that disabling TTT slightly outperformed legal TTT for this SP1024 setup