PR #918

open

Record: TurboQuant + Full-Rescore N-gram (val_bpb=0.1653)

by haikosysView on GitHub
val_bpb
0.1653
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.35 MB

Training Techniques

Architecture
BigramHash
Bigram hash embedding module used in the model.
parameters: {"dimensions":128,"hash_size":2048}
SmearGate
SmearGate component included in the architecture.
parameters: null
U-Net skip connections
U-Net style skip connections in the network.
parameters: null
Partial RoPE
Partial rotary positional embeddings applied to part of the model.
parameters: {"dimensions":16}
LeakyReLU
LeakyReLU squared activation used in the MLP.
parameters: {"squared":true,"negative_slope":0.5}
XSA
XSA used in the last 4 layers.
parameters: {"layers":4}
weight tying
Tied input and output embeddings.
parameters: null
Regularization
logit softcap
parameters: {"value":30}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"lr":0.025}
AdamW
weight_decay: 0.04
momentum: null
other_params: {"embeddings_lr":0.035,"scalars_lr":0.025}
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"every_n_steps":50}
Quantization
QAT
bits: 2
scope: MLP up
QAT
bits: 3
scope: attn/MLP down
QAT
bits: 4
scope: embeddings
STE QAT
bits: null
scope: all
Compression
lzma
level: 6
Evaluation
sliding window eval
parameters: {"stride":64}
full-rescore n-gram cache
parameters: {"order_min":2,"order_max":12,"entropy_adaptive_alpha":true}
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Sequence Length
sequence_length
train_length: 2048
eval_length: null

Novel Contributions

  • TurboQuant rotation-based Lloyd-Max codebook quantization for weight compression
  • Progressive QAT warmdown from 4-bit to 3-bit to 2-bit
  • Two-pass full-rescore n-gram cache evaluation with entropy-adaptive alpha blending
  • Combining higher-parameter TurboQuant models with full-rescore n-gram cache to recover validation performance