PR #1379

open

Record: 0.4162 BPB mixed quant ngram (post-fix reruns)

by LucasErcolanoView on GitHub
val_bpb
0.4162
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
15,623,718 bytes

Training Techniques

Quantization
mixed int5/int6
bits: null
scope: MLP int5; attention/embeddings int6
Architecture
GQA
Grouped query attention in the base transformer
parameters: {"heads":8,"kv_heads":4}
MLP3x
Expanded MLP width
parameters: {"multiplier":3}
LeakyReLU
LeakyReLU squared activation
parameters: {"squared":true,"slope":0.5}
SmearGate
SmearGate component in the base neural stack
parameters: null
BigramHash
Bigram hash component used in the model
parameters: {"size":2048}
VE128
Value-Residual Embeddings
parameters: {"dimensions":128}
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: null
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: {"stride":256}
Test-Time Training
score-first TTT
parameters: null
Initialization
OrthoInit
Orthogonal initialization
Regularization
weight decay
parameters: null
Other
other
Complementary training that down-weights tokens easily predicted by n-grams
parameters: {"loss_reweighting":"1 - alpha * p_bigram(token)"}
other
Causal backoff n-gram mixer with entropy-adaptive blending
parameters: null
other
DDP-safe score-first update protocol with synchronization before cache update
parameters: null

Novel Contributions

  • Post-hash-fix rerun of the mixed quant n-gram record
  • Mixed precision quantization with int5 MLP weights and int6 attention/embedding weights
  • Complementary training to focus the neural model on tokens poorly predicted by n-grams
  • Causal backoff n-gram mixer with entropy-adaptive blending
  • DDP-safe score-first update protocol for multi-GPU evaluation
  • Aligned higher-order n-gram hash ordering between update and score