PR #1454

open

Non-record: TurboQuant + N-gram Hybrid Eval + TTT (1xH100 NVL)

by bsisduckView on GitHub
val_bpb
0.3509
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
14.92 MB

Training Techniques

Quantization
int5
bits: 5
scope: all
late QAT
bits: null
scope: all
Architecture
Gated Attention
Gated attention used in all 9 layers.
parameters: {"layers":9}
XSA
XSA attention used on the last 4 layers.
parameters: {"layers":4}
BigramHash
Bigram hash component for token statistics.
parameters: {"dimensions":1536}
Partial RoPE
RoPE applied to a subset of dimensions.
parameters: {"dimensions":16,"total_dimensions":64}
Value Residual
Value residual connections across early layers.
parameters: {"layers":"1-8"}
SmearGate
Position-mixing gate used in the model.
parameters: null
Weight Averaging
EMA
parameters: {"decay":0.997}
Evaluation
stride-based eval
parameters: {"stride":384}
Test-Time Training
LoRA TTT
parameters: {"rank":8,"learning_rate":0.005,"temperature":1.1,"polyak_decay":0.995}
Regularization
complementary training
parameters: {"alpha":0.5}
Compression
lzma
level: null
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: {"newton_schulz":true}

Novel Contributions

  • TurboQuant random-rotation int5 quantization
  • Entropy-adaptive order-9 n-gram backoff cache
  • Complementary training to emphasize harder predictions
  • LoRA test-time training with tuned temperature and Polyak averaging
  • Hybrid neural + n-gram evaluation