PR #1186

open

Non-record: Negative Results — Architecture, TTT Variants, Quantization, and N-gram Cache Illegality

by andrewbaggio1View on GitHub

val_bpb

0.9850

Architecture

Transformer

Optimizer

AdamW

Artifact Size

15.62 MB

Training Techniques

Test-Time Training

full TTT

parameters: {"epochs":20,"learning_rate":null}

LR Schedule

cosine decay

parameters: null

Evaluation

sliding window eval

parameters: {"stride":64,"seq_len":2048}

Architecture

LeakyReLU

Base architecture uses LeakyReLU(0.5)^2 stack

parameters: {"slope":0.5}

BigramHash

Hashed n-gram cache / count-sketch interpolation

parameters: {"order":2}

TrigramHash

Evaluated trigram hash variant

parameters: {"order":3}

depth recurrence

Huginn-style depth recurrence

parameters: null

MLP3x

MLP width multiplier variant

parameters: {"multiplier":3.25}

XSA

XSA-all variant

parameters: {"layers":11}

Optimizer

SGD

weight_decay: null

momentum: 0.9

other_params: null

Quantization

int6

bits: 6

scope: all

int5

bits: 5

scope: all

Other

other

Entropy-adaptive n-gram cache interpolation with multi-order backoff (2,3,4,5-gram) using score-first cache updates

parameters: {"orders":[2,3,4,5]}

Novel Contributions

Combined cosine TTT with multi-order n-gram cache at inference time
Used entropy-adaptive mixing between model probabilities and n-gram probabilities
Documented negative results for architecture changes, TTT variants, and quantization
Showed hashed n-gram caches can be fundamentally broken due to collision-driven probability inflation
Compared several architecture variants including depth recurrence, TrigramHash, MLP width scaling, and XSA-all