PR #1320

open

Non-record: TTT chunk ordering does not improve BPB — negative results from 7 ordering variants

by jpfeiffeView on GitHub
val_bpb
1.1196
Architecture
Transformer
Optimizer
SGD
Artifact Size

Training Techniques

Architecture
LeakyReLU
Uses LeakyReLU(0.5)^2 activation in the model.
parameters: {"slope":0.5,"power":2}
BigramHash
Uses a BigramHash embedding component.
parameters: {"dimensions":1536}
Quantization
int6
bits: 6
scope: per-row
Compression
zstd
level: 22
Test-Time Training
score-first TTT
parameters: {"epochs_per_chunk":3,"learning_rate":0.002,"momentum":0.9,"all_gpu_per_chunk":true,"full_sequence_loss":true,"skip_final_chunk_training":true}
LR Schedule
cosine decay
parameters: {"per_chunk":true}
Optimizer
SGD
weight_decay: null
momentum: 0.9
other_params: {"learning_rate":0.002}
Evaluation
sliding window eval
parameters: {"full":true}
Other
other
Document ordering by embedding similarity and clustering variants for chunk scheduling, including nearest-neighbor ordering, majority-overlap clustering, microcluster bin-packing, and contiguous shard ordering.
parameters: {"variants_tested":7}

Novel Contributions

  • Systematic negative-result study of whether chunk ordering improves score-first TTT BPB
  • Evaluation of seven ordering and clustering variants for TTT chunk scheduling
  • Finding that global nearest-neighbor ordering provides no meaningful improvement over sequential order
  • Finding that clustering and sharded execution variants are worse than the sequential baseline
  • Analysis showing alignment gating does not beat always-updating TTT