PR #865
openRecord: 11L Parallel Muon + N-gram Backoff Cache — val_bpb 0.2841 (3-seed mean)
by aryanbhosaleView on GitHub
val_bpb
0.2841
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
~15.85 MB
Training Techniques
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: {"parameter_banking":true,"batched_ns5":true}
Architecture
GQA
Grouped-query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
LeakyReLU
MLP uses 3x LeakyReLU(0.5)^2.
parameters: {"multiplier":3,"slope":0.5}
SmearGate
SmearGate component in the architecture.
parameters: null
BigramHash
BigramHash embedding/component with size 1024.
parameters: {"dimensions":1024}
Value Residual
Value residual pathway is used.
parameters: null
Gated Attention
Attention mechanism includes gating.
parameters: null
XSA
XSA4 component is included.
parameters: {"variant":4}
Partial RoPE
Partial rotary positional embeddings applied to 16/64 dimensions.
parameters: {"dimensions":"16/64"}
U-Net skip connections
U-Net style skip connections are used.
parameters: null
OrthoInit
Orthogonal initialization is used.
parameters: null
Weight Averaging
EMA + SWA
parameters: {"decay":0.997}
Quantization
GPTQ-lite
bits: 6
scope: model
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Other
other
Eval-time backward-looking N-gram backoff cache with entropy-adaptive alpha blending and chunked score-then-update processing.
parameters: {"order_range":"2-9","chunk_size_tokens":65000,"hash_buckets":4000000,"backward_looking":true,"score_first":true}
Novel Contributions
- Eval-time backward-looking N-gram backoff cache
- Entropy-adaptive alpha blending between model and N-gram probabilities
- Chunked score-then-update cache refresh every 65K tokens
- Multi-order backoff with per-order weighting across orders 2-9
- Parallel Muon with parameter banking and batched Newton-Schulz
- Combined architecture stack with SmearGate, BigramHash, GQA, Value Residual, and gated attention