PR #1861

closed

SkipQuant Adapter TTT + Causal PPM-D Byte Mixture (~1.1876 BPB 1M slice, ~0.8997 est)

by Hetul803View on GitHub

val_bpb

1.1877

Architecture

Transformer

Optimizer

—

Artifact Size

15,993,558 bytes

Training Techniques

Quantization

SkipQuant

bits: null

scope: model weights

Test-Time Training

score-first TTT

parameters: {"epochs":2,"chunk":8192}

Architecture

adapter

Eval-time adapter memory module with fixed random feature projection A and zero-initialized trainable projection B; B is updated after scoring each chunk.

parameters: {"rank":1024}

Other

other

Byte-level PPM-D causal mixture combined with neural probabilities using a confidence-gated convex interpolation.

parameters: {"order":5,"lambda_low":0.9,"lambda_high":0.05,"threshold":0.78}

Sequence Length

sequence_length

train_length: 8192

eval_length: 1048576

Novel Contributions

Strict score-before-update causal evaluation for both TTT and PPM
Eval-time adapter memory TTT with rank-1024 projection
Byte-level PPM-D mixture on top of a quantized SkipQuant TTT stack
Confidence-gated convex mixture between neural and PPM probabilities
Token-to-byte probability distribution for byte-level BPB accounting