PR #1186

open

Non-record: Negative Results — Architecture, TTT Variants, Quantization, and N-gram Cache Illegality

by andrewbaggio1View on GitHub
val_bpb
0.9850
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15.62 MB

Training Techniques

Test-Time Training
full TTT
parameters: {"epochs":20,"learning_rate":null}
LR Schedule
cosine decay
parameters: null
Evaluation
sliding window eval
parameters: {"stride":64,"seq_len":2048}
Architecture
LeakyReLU
Base architecture uses LeakyReLU(0.5)^2 stack
parameters: {"slope":0.5}
BigramHash
Hashed n-gram cache / count-sketch interpolation
parameters: {"order":2}
TrigramHash
Evaluated trigram hash variant
parameters: {"order":3}
depth recurrence
Huginn-style depth recurrence
parameters: null
MLP3x
MLP width multiplier variant
parameters: {"multiplier":3.25}
XSA
XSA-all variant
parameters: {"layers":11}
Optimizer
SGD
weight_decay: null
momentum: 0.9
other_params: null
Quantization
int6
bits: 6
scope: all
int5
bits: 5
scope: all
Other
other
Entropy-adaptive n-gram cache interpolation with multi-order backoff (2,3,4,5-gram) using score-first cache updates
parameters: {"orders":[2,3,4,5]}

Novel Contributions

  • Combined cosine TTT with multi-order n-gram cache at inference time
  • Used entropy-adaptive mixing between model probabilities and n-gram probabilities
  • Documented negative results for architecture changes, TTT variants, and quantization
  • Showed hashed n-gram caches can be fundamentally broken due to collision-driven probability inflation
  • Compared several architecture variants including depth recurrence, TrigramHash, MLP width scaling, and XSA-all