PR #1625

open

[Non-record] E2E TTT at 27M scale — negative result (val_bpb 1.1104, SP1024)

by ChideraIbe123View on GitHub

val_bpb

1.1104

Architecture

Transformer

Optimizer

—

Artifact Size

13.85 MB

Training Techniques

Test-Time Training

full TTT

parameters: {"mode":"E2E","scope":"MLP-only in last fraction of blocks","last_frac":null,"learning_rate":0.015,"epochs":2}

Architecture

MLP

TTT parameters filtered to MLPs in the last fraction of blocks for end-to-end test-time training.

parameters: {"blocks":"L5-L10"}

Evaluation

sliding window eval

parameters: null

Compression

lzma

level: null

End-to-End Test-Time Training (E2E TTT) ported onto the merged SOTA stack
3-config ablation of TTT hyperparameters at 27M scale
Negative result showing only about 0.001 BPB total gain across large changes in learning rate, trainable parameters, and epochs
MLP-only TTT applied to the last fraction of blocks