PR #893

open

Record: Two-Pass Order-12 N-gram Backoff + Parallel Muon — val_bpb 0.1310 (3-seed)

by aryanbhosaleView on GitHub

val_bpb

0.1310

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

~15.85 MB

Training Techniques

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

score-first TTT

parameters: {"passes":2,"cache_orders":"2-12","cold_cache_chunks":50}

Architecture

Parallel Muon

Parallel Muon optimizer with parameter banking and batched Newton-Schulz.

parameters: {"layers":11,"dimensions":512}

BigramHash

Bigram hash feature module.

parameters: {"size":1024}

Gated Attention

Attention mechanism with gating.

parameters: null

Value Residual

Residual value pathway in the model.

parameters: null

XSA

XSA4 attention/sequence module.

parameters: {"variant":"XSA4"}

SmearGate

SmearGate component used in the architecture.

parameters: null

U-Net skip connections

U-Net style skip connections.

parameters: null

Partial RoPE

Partial rotary positional embeddings.

parameters: {"16/64":true}

LeakyReLU

MLP uses LeakyReLU squared activation.

parameters: {"mlp_multiplier":"3x","power":2,"slope":0.5}

Weight Averaging

EMA + SWA

parameters: {"ema_decay":0.997}

Quantization

GPTQ-lite

bits: 6

scope: model weights

Compression

zstd

level: 22

Sequence Length

sequence_length

train_length: null

eval_length: 65536

Novel Contributions

Two-pass evaluation with order-12 N-gram backoff rescoring
Entropy-adaptive alpha blending for N-gram/model interpolation
Backward-looking N-gram cache updated only after scoring
Parallel Muon optimization with parameter banking
Large hash-based N-gram cache over validation tokens