PR #754

open

Non-Record: 11L Parallel Muon + LeakyReLU² MLP3x + Legal TTT (val_bpb 1.1253)

by aryanbhosaleView on GitHub

val_bpb

1.1253

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

~15 MB

Training Techniques

Optimizer

Parallel Muon

weight_decay: 0.04

momentum: 0.92

other_params: {"momentum_schedule":"0.92→0.99 over 1500 steps","newton_schulz_steps":5,"parameter_banking":true,"async_reduce_scatter_all_gather":true}

Architecture

MLP3x

3x expansion MLP with LeakyReLU(0.5)^2 activation

parameters: {"hidden_dim":1536}

SmearGate

Additional gating mechanism in the architecture

parameters: null

BigramHash

Bigram hash feature module

parameters: {"size":1536,"dim":128}

Value Residual

Caches V from layer 0 and blends via learned lambda

parameters: null

Gated Attention

Per-head sigmoid gating for attention outputs

parameters: null

XSA

Exclusive self-attention applied to the last 4 layers

parameters: {"layers":4}

Partial RoPE

Rotary positional embeddings applied to a subset of head dimensions

parameters: {"dimensions":"16/64"}

tied embeddings

Input and output embeddings are tied

parameters: null

Initialization

OrthoInit

Orthogonal initialization

Weight Averaging

EMA

parameters: {"decay":0.997}

SWA

parameters: {"interval":"every 50 steps when scale < 0.2"}

Quantization

GPTQ-lite

bits: 6

scope: per-row weights

STE QAT

bits: 6

scope: all weights

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64,"chunk_size":32000}

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.002,"momentum":0.9,"epochs":3,"chunk_size":32000}

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Regularization

weight decay

parameters: {"value":0.04}

Novel Contributions

Parallel Muon with parameter banking and batched Newton-Schulz updates
LeakyReLU(0.5)^2 MLP 3x expansion
Legal score-first test-time training (TTT) with score-before-update enforcement
EMA plus SWA model averaging
GPTQ-lite int6 quantization with per-row 5-percentile clip search
Flash Attention 3 and torch.compile(fullgraph=True) without DDP