PR #1625

open

[Non-record] E2E TTT at 27M scale — negative result (val_bpb 1.1104, SP1024)

by ChideraIbe123View on GitHub
val_bpb
1.1104
Architecture
Transformer
Optimizer
Artifact Size
13.85 MB

Training Techniques

Test-Time Training
full TTT
parameters: {"mode":"E2E","scope":"MLP-only in last fraction of blocks","last_frac":null,"learning_rate":0.015,"epochs":2}
Architecture
MLP
TTT parameters filtered to MLPs in the last fraction of blocks for end-to-end test-time training.
parameters: {"blocks":"L5-L10"}
Evaluation
sliding window eval
parameters: null
Compression
lzma
level: null

Novel Contributions

  • End-to-End Test-Time Training (E2E TTT) ported onto the merged SOTA stack
  • 3-config ablation of TTT hyperparameters at 27M scale
  • Negative result showing only about 0.001 BPB total gain across large changes in learning rate, trainable parameters, and epochs
  • MLP-only TTT applied to the last fraction of blocks