PR #850
openRecord: 0.3212 BPB — Complementary N-gram 65K + Int5 GPTQ + LoRA TTT
by callithyiaView on GitHub
val_bpb
0.3212
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
~14.9 MB
Training Techniques
Architecture
BigramHash
BigramHash 4096-bucket embedding used in the model architecture.
parameters: {"buckets":4096}
MLP3x
MLP with 3.0x expansion and LeakyReLU(0.9) squared.
parameters: {"expansion":3,"hidden":1536}
XSA
XSA applied on the last 4 layers.
parameters: {"layers":4}
Value Residual Learning
Value Residual Learning applied across layers 1-10.
parameters: {"layers":[1,10]}
Gated Attention
Gated Attention with bias 4.0 on all layers.
parameters: {"bias":4}
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: {"newton_schulz":5,"per_group_banking":true,"encoder_lr":0.025,"decoder_lr":0.05}
Weight Averaging
Polyak averaging
parameters: {"decay":0.998}
Compression
lzma
level: 9
Evaluation
order-9 n-gram backoff cache
parameters: {"orders":[2,9],"chunk_size":65536,"cache_buckets":4000000,"entropy_adaptive_alpha_blending":true}
Test-Time Training
LoRA TTT
parameters: {"rank":8,"qv_blocks":[9,10],"learning_rate":0.003,"polyak_decay":0.998,"score_first":true}
LR Schedule
WSD
parameters: {"stable_fraction":0.75,"decay":"cosine"}
Quantization
GPTQ
bits: 5
scope: all
Regularization
EMA
parameters: {"decay":0.997}
Other
other
Complementary training that downweights bigram-predictable tokens during training.
parameters: {"alpha":0.5}
other
Late QAT with Soft-Round quantization-aware training triggered near the end of training.
parameters: {"trigger_fraction":0.85}
Novel Contributions
- Complementary training combined with an order-9 n-gram cache
- 65K-token chunks for more frequent cache refreshes
- Full Hessian GPTQ int5 with LZMA compression
- LoRA test-time training with Polyak averaging and score-first backward-looking protocol
- Per-order entropy centers and multipliers for n-gram alpha computation