PR #1416
openRecord: SP8192 + Pre-Quant TTT — val_bpb 1.07948 (3-seed mean)
by erichroepkeView on GitHub
val_bpb
1.0795
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15.12 MB
Training Techniques
Quantization
GPTQ
bits: null
scope: embeddings and weights
GPTQ
bits: null
scope: embeddings
Architecture
depth recurrence
Loops layers 4-5 twice to increase effective depth without increasing parameter count.
parameters: null
XSA
Removes self-attention redundancy via projection across all layers.
parameters: {"layers":"all"}
U-Net skip connections
Learned gating on skip connections.
parameters: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"variant":"MuonEq-R"}
Weight Averaging
EMA
parameters: {"decay":0.997}
Test-Time Training
full TTT
parameters: {"optimizer":"AdamW","epochs":6,"timing":"pre-quant"}
Other
other
SDClip quantization clipping using threshold = k × std(row) instead of grid search.
parameters: null
other
SP8192 tokenizer / vocabulary.
parameters: {"vocab_size":8192}
Novel Contributions
- Combined SP8192 base architecture with pre-quant AdamW TTT.
- Showed that pre-quant TTT and the SP8192 + SDClip + GPTQ pipeline stack without interfering.
- Achieved a new record val_bpb of 1.07948 using a 3-seed mean.
- Applied TTT before quantization so the adapted full-precision weights compress cleanly.