PR #1272

open

Non-record: Comprehensive Negative Results — What Doesn't Work on Strong Models

by andrewbaggio1View on GitHub
val_bpb
1.1100
Architecture
Transformer
Optimizer
Artifact Size

Training Techniques

Quantization
GPTQ
bits: null
scope: all
Evaluation
sliding window eval
parameters: null
n-gram cache
parameters: {"smoothing":"Kneser-Ney","order":7,"normalization":"exact trie"}
Other
other
Online logit bias via per-token SGD on a logit bias vector
parameters: null
other
Complementary training that down-weights n-gram-predictable tokens during training
parameters: null
other
SLOT-based method with causal dependence violation concerns
parameters: null
other
Scylla tokenizer with corrected byte accounting
parameters: null
Architecture
MLP adapters
Zero-init rank-64 prime MLP adapters
parameters: {"rank":64,"init":"zero"}
XSA
XSA applied on all layers
parameters: {"layers":"all"}
Test-Time Training
score-first TTT
parameters: null

Novel Contributions

  • Comprehensive negative-results report on techniques that do not improve strong GPTQ'd models
  • Proof that properly normalized n-gram caches provide only negligible gains on strong models
  • Demonstration that online logit bias hurts and exceeds eval budget
  • Evidence that prime MLP adapters, complementary training, and score-first chunked TTT are ineffective on the reported baseline
  • Claim that prior Scylla sub-1.0 BPP results were due to byte accounting bugs
  • Identification of dominant factors as training data volume, Full Hessian GPTQ, coprime-stride data loading, and XSA