PR #865

open

Record: 11L Parallel Muon + N-gram Backoff Cache — val_bpb 0.2841 (3-seed mean)

by aryanbhosaleView on GitHub

val_bpb

0.2841

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

~15.85 MB

Training Techniques

Optimizer

Parallel Muon

weight_decay: null

momentum: null

other_params: {"parameter_banking":true,"batched_ns5":true}

Architecture

GQA

Grouped-query attention with 8 attention heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

LeakyReLU

MLP uses 3x LeakyReLU(0.5)^2.

parameters: {"multiplier":3,"slope":0.5}

SmearGate

SmearGate component in the architecture.

parameters: null

BigramHash

BigramHash embedding/component with size 1024.

parameters: {"dimensions":1024}

Value Residual

Value residual pathway is used.

parameters: null

Gated Attention

Attention mechanism includes gating.

parameters: null

XSA

XSA4 component is included.

parameters: {"variant":4}

Partial RoPE

Partial rotary positional embeddings applied to 16/64 dimensions.

parameters: {"dimensions":"16/64"}

U-Net skip connections

U-Net style skip connections are used.

parameters: null

OrthoInit

Orthogonal initialization is used.

parameters: null

Weight Averaging

EMA + SWA

parameters: {"decay":0.997}

Quantization

GPTQ-lite

bits: 6

scope: model

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Other

other

Eval-time backward-looking N-gram backoff cache with entropy-adaptive alpha blending and chunked score-then-update processing.

parameters: {"order_range":"2-9","chunk_size_tokens":65000,"hash_buckets":4000000,"backward_looking":true,"score_first":true}

Novel Contributions

Eval-time backward-looking N-gram backoff cache
Entropy-adaptive alpha blending between model and N-gram probabilities
Chunked score-then-update cache refresh every 65K tokens
Multi-order backoff with per-order weighting across orders 2-9
Parallel Muon with parameter banking and batched Newton-Schulz
Combined architecture stack with SmearGate, BigramHash, GQA, Value Residual, and gated attention