PR #864

closed

Record: 11L Parallel Muon + N-gram Backoff Cache — val_bpb 0.2841 (3-seed mean)

by aryanbhosaleView on GitHub

val_bpb

0.2841

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

~15.85 MB

Training Techniques

Optimizer

Parallel Muon

weight_decay: null

momentum: null

other_params: {"parameter_banking":true,"batched_ns5":true}

Architecture

GQA

Grouped query attention with 8 attention heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

LeakyReLU

MLP uses 3x LeakyReLU(0.5)^2 activation.

parameters: {"multiplier":3,"slope":0.5}

SmearGate

Custom gating component used in the model.

parameters: null

BigramHash

Bigram hash component with 1024 buckets.

parameters: {"buckets":1024}

Value Residual

Adds value residual connections.

parameters: null

Gated Attention

Attention mechanism includes gating.

parameters: null

XSA4

XSA4 architectural component.

parameters: null

Partial RoPE

Partial rotary positional embeddings applied to 16 of 64 dimensions.

parameters: {"dimensions":"16/64"}

U-Net skip connections

U-Net style skip connections are used.

parameters: null

OrthoInit

Orthogonal initialization.

parameters: null

Weight Averaging

EMA + SWA

parameters: {"ema_decay":0.997}

Quantization

late QAT

bits: 6

scope: model

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Other

other

Eval-time backward-looking N-gram backoff cache with entropy-adaptive alpha blending and chunked score-then-update processing.

parameters: {"orders":"2-9","chunk_size":65000,"hash_buckets":4000000}

Novel Contributions

Eval-time backward-looking N-gram backoff cache
Entropy-adaptive alpha blending between model and N-gram probabilities
Chunked score-then-update cache refresh every 65K tokens
Multi-order backoff with per-order weighting across orders 2-9
Parallel Muon with parameter banking and batched Newton-Schulz
Compact 11-layer Transformer with multiple custom architectural components