PR #1272

open

Non-record: Comprehensive Negative Results — What Doesn't Work on Strong Models

by andrewbaggio1View on GitHub

val_bpb

1.1100

Architecture

Transformer

Optimizer

—

Artifact Size

—

Training Techniques

Quantization

GPTQ

bits: null

scope: all

Evaluation

sliding window eval

parameters: null

n-gram cache

parameters: {"smoothing":"Kneser-Ney","order":7,"normalization":"exact trie"}

Other

other

Online logit bias via per-token SGD on a logit bias vector

parameters: null

other

Complementary training that down-weights n-gram-predictable tokens during training

parameters: null

other

SLOT-based method with causal dependence violation concerns

parameters: null

other

Scylla tokenizer with corrected byte accounting

parameters: null

Architecture

MLP adapters

Zero-init rank-64 prime MLP adapters

parameters: {"rank":64,"init":"zero"}

XSA

XSA applied on all layers

parameters: {"layers":"all"}

Test-Time Training

score-first TTT

parameters: null

Comprehensive negative-results report on techniques that do not improve strong GPTQ'd models
Proof that properly normalized n-gram caches provide only negligible gains on strong models
Demonstration that online logit bias hurts and exceeds eval budget
Evidence that prime MLP adapters, complementary training, and score-first chunked TTT are ineffective on the reported baseline
Claim that prior Scylla sub-1.0 BPP results were due to byte accounting bugs
Identification of dominant factors as training data volume, Full Hessian GPTQ, coprime-stride data loading, and XSA