PR #1298

open

Record: Polar Express NS + SLOT + MuonEq-R + XSA-all — 1.1043 BPB (3-seed mean)

by OmrigotliebView on GitHub

val_bpb

1.1043

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

15.82 MB

Training Techniques

Architecture

BigramHash

Bigram vocabulary embedding/hash component used in the model.

parameters: {"vocab_size":1536}

XSA

Exclusive self-attention applied across all layers.

parameters: {"layers":11}

Partial RoPE

Rotary positional embeddings applied to a subset of dimensions.

parameters: {"dimensions":16}

MLP3x

Three-layer MLP stack with LeakyReLU squared activation.

parameters: {"activation":"LeakyReLU²"}

VE128

Value residual enhancement on selected layers.

parameters: {"dimensions":128,"layers":[9,10]}

Optimizer

Parallel Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"muon_momentum_warmup_start":0.92,"muon_momentum_warmup_steps":1500,"muon_backend_steps":4}

Muon

weight_decay: null

momentum: null

other_params: {"polar_express_steps":4,"muon_eq_r":true}

Weight Averaging

EMA + SWA

parameters: {"decay":0.997,"every":50}

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

SLOT

parameters: {"steps":8,"learning_rate":0.005}

Regularization

LN scale

parameters: {"scale":"1/sqrt(layer+1)"}

Quantization

GPTQ-lite

bits: 6

scope: all

Compression

lzma

level: 9

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Novel Contributions

Polar Express Newton-Schulz with per-iteration minimax-optimal polynomials
SLOT eval-time delta optimization
MuonEq-R row-normalized gradient reparameterization
XSA extended to all 11 layers
3-seed mean validation result of 1.1043 BPB