PR #1320
openNon-record: TTT chunk ordering does not improve BPB — negative results from 7 ordering variants
by jpfeiffeView on GitHub
val_bpb
1.1196
Architecture
Transformer
Optimizer
SGD
Artifact Size
—
Training Techniques
Architecture
LeakyReLU
Uses LeakyReLU(0.5)^2 activation in the model.
parameters: {"slope":0.5,"power":2}
BigramHash
Uses a BigramHash embedding component.
parameters: {"dimensions":1536}
Quantization
int6
bits: 6
scope: per-row
Compression
zstd
level: 22
Test-Time Training
score-first TTT
parameters: {"epochs_per_chunk":3,"learning_rate":0.002,"momentum":0.9,"all_gpu_per_chunk":true,"full_sequence_loss":true,"skip_final_chunk_training":true}
LR Schedule
cosine decay
parameters: {"per_chunk":true}
Optimizer
SGD
weight_decay: null
momentum: 0.9
other_params: {"learning_rate":0.002}
Evaluation
sliding window eval
parameters: {"full":true}
Other
other
Document ordering by embedding similarity and clustering variants for chunk scheduling, including nearest-neighbor ordering, majority-overlap clustering, microcluster bin-packing, and contiguous shard ordering.
parameters: {"variants_tested":7}
Novel Contributions
- Systematic negative-result study of whether chunk ordering improves score-first TTT BPB
- Evaluation of seven ordering and clustering variants for TTT chunk scheduling
- Finding that global nearest-neighbor ordering provides no meaningful improvement over sequential order
- Finding that clustering and sharded execution variants are worse than the sequential baseline
- Analysis showing alignment gating does not beat always-updating TTT