PR #864
closedRecord: 11L Parallel Muon + N-gram Backoff Cache — val_bpb 0.2841 (3-seed mean)
by aryanbhosaleView on GitHub
val_bpb
0.2841
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
~15.85 MB
Training Techniques
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: {"parameter_banking":true,"batched_ns5":true}
Architecture
GQA
Grouped query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
LeakyReLU
MLP uses 3x LeakyReLU(0.5)^2 activation.
parameters: {"multiplier":3,"slope":0.5}
SmearGate
Custom gating component used in the model.
parameters: null
BigramHash
Bigram hash component with 1024 buckets.
parameters: {"buckets":1024}
Value Residual
Adds value residual connections.
parameters: null
Gated Attention
Attention mechanism includes gating.
parameters: null
XSA4
XSA4 architectural component.
parameters: null
Partial RoPE
Partial rotary positional embeddings applied to 16 of 64 dimensions.
parameters: {"dimensions":"16/64"}
U-Net skip connections
U-Net style skip connections are used.
parameters: null
OrthoInit
Orthogonal initialization.
parameters: null
Weight Averaging
EMA + SWA
parameters: {"ema_decay":0.997}
Quantization
late QAT
bits: 6
scope: model
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Other
other
Eval-time backward-looking N-gram backoff cache with entropy-adaptive alpha blending and chunked score-then-update processing.
parameters: {"orders":"2-9","chunk_size":65000,"hash_buckets":4000000}
Novel Contributions
- Eval-time backward-looking N-gram backoff cache
- Entropy-adaptive alpha blending between model and N-gram probabilities
- Chunked score-then-update cache refresh every 65K tokens
- Multi-order backoff with per-order weighting across orders 2-9
- Parallel Muon with parameter banking and batched Newton-Schulz
- Compact 11-layer Transformer with multiple custom architectural components