PR #1320

open

Non-record: TTT chunk ordering does not improve BPB — negative results from 7 ordering variants

by jpfeiffeView on GitHub

val_bpb

1.1196

Architecture

Transformer

Optimizer

SGD

Artifact Size

—

Training Techniques

Architecture

LeakyReLU

Uses LeakyReLU(0.5)^2 activation in the model.

parameters: {"slope":0.5,"power":2}

BigramHash

Uses a BigramHash embedding component.

parameters: {"dimensions":1536}

Quantization

int6

bits: 6

scope: per-row

Compression

zstd

level: 22

Test-Time Training

score-first TTT

parameters: {"epochs_per_chunk":3,"learning_rate":0.002,"momentum":0.9,"all_gpu_per_chunk":true,"full_sequence_loss":true,"skip_final_chunk_training":true}

LR Schedule

cosine decay

parameters: {"per_chunk":true}

Optimizer

SGD

weight_decay: null

momentum: 0.9

other_params: {"learning_rate":0.002}

Evaluation

sliding window eval

parameters: {"full":true}

Other

other

Document ordering by embedding similarity and clustering variants for chunk scheduling, including nearest-neighbor ordering, majority-overlap clustering, microcluster bin-packing, and contiguous shard ordering.

parameters: {"variants_tested":7}

Novel Contributions

Systematic negative-result study of whether chunk ordering improves score-first TTT BPB
Evaluation of seven ordering and clustering variants for TTT chunk scheduling
Finding that global nearest-neighbor ordering provides no meaningful improvement over sequential order
Finding that clustering and sharded execution variants are worse than the sequential baseline
Analysis showing alignment gating does not beat always-updating TTT