PR #1861

closed

SkipQuant Adapter TTT + Causal PPM-D Byte Mixture (~1.1876 BPB 1M slice, ~0.8997 est)

by Hetul803View on GitHub
val_bpb
1.1877
Architecture
Transformer
Optimizer
Artifact Size
15,993,558 bytes

Training Techniques

Quantization
SkipQuant
bits: null
scope: model weights
Test-Time Training
score-first TTT
parameters: {"epochs":2,"chunk":8192}
Architecture
adapter
Eval-time adapter memory module with fixed random feature projection A and zero-initialized trainable projection B; B is updated after scoring each chunk.
parameters: {"rank":1024}
Other
other
Byte-level PPM-D causal mixture combined with neural probabilities using a confidence-gated convex interpolation.
parameters: {"order":5,"lambda_low":0.9,"lambda_high":0.05,"threshold":0.78}
Sequence Length
sequence_length
train_length: 8192
eval_length: 1048576

Novel Contributions

  • Strict score-before-update causal evaluation for both TTT and PPM
  • Eval-time adapter memory TTT with rank-1024 projection
  • Byte-level PPM-D mixture on top of a quantized SkipQuant TTT stack
  • Confidence-gated convex mixture between neural and PPM probabilities
  • Token-to-byte probability distribution for byte-level BPB accounting