PR #1452
closedNon-record: TurboQuant + N-gram Hybrid Eval + TTT (1xH100 NVL)
by bsisduckView on GitHub
val_bpb
0.3509
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
14.92 MB
Training Techniques
Quantization
int5
bits: 5
scope: all
late QAT
bits: null
scope: all
Architecture
Gated Attention
Gated attention used in all layers.
parameters: {"layers":9}
XSA
XSA attention used on the last 4 layers.
parameters: {"layers":4}
BigramHash
Bigram hash component used for hybrid n-gram modeling.
parameters: {"size":1536}
Partial RoPE
Rotary positional embeddings applied to a subset of dimensions.
parameters: {"dimensions":"16/64"}
Value Residual
Value residual connections used across early layers.
parameters: {"layers":"1-8"}
SmearGate
Position-mixing gate used in the model.
parameters: null
Weight Averaging
EMA
parameters: {"decay":0.997}
Evaluation
stride-based eval
parameters: {"stride":384}
Test-Time Training
LoRA TTT
parameters: {"rank":8,"learning_rate":0.005,"polyak_decay":0.995,"temperature":1.1}
Compression
lzma
level: null
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: {"newton_schulz":true}
Regularization
complementary training
parameters: {"alpha":0.5}
Other
other
Entropy-adaptive order-9 n-gram backoff cache mixed with neural probabilities during evaluation.
parameters: {"order":9,"min_count":1}
Novel Contributions
- TurboQuant random-rotation int5 quantization
- Entropy-adaptive order-9 n-gram backoff cache
- Complementary training with alpha=0.5
- LoRA test-time training with tuned temperature and Polyak averaging
- Hybrid neural + n-gram evaluation