PR #549

RECORDclosed

Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean)

by abaybektursunView on GitHub

val_bpb

1.1194

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

~15.95 MB

Training Techniques

Architecture

MLP3x

Three-layer MLP stack using LeakyReLU(0.5)^2 activation.

parameters: {"layers":3}

BigramHash

BigramHash token feature embedding.

parameters: {"size":1536}

XSA

XSA applied to the last 4 layers.

parameters: {"layers":4}

Partial RoPE

Rotary positional embeddings applied partially to a subset of dimensions.

parameters: {"dimensions":16}

KV head count

Uses 8 attention heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

Optimizer

Parallel Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"muon_momentum_warmup_start":0.92,"muon_momentum_warmup_steps":1500,"matrix_lr":0.025,"scalar_lr":0.025,"tied_embed_lr":0.035}

SGD

weight_decay: null

momentum: 0.9

other_params: {"learning_rate":0.002}

Weight Averaging

EMA

parameters: {"decay":0.997}

SWA

parameters: {"every":50,"tight":true}

Quantization

GPTQ-lite

bits: 6

scope: all

Compression

lzma

level: null

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

score-first TTT

parameters: {"chunk_size":32768,"epochs":3,"learning_rate":0.002,"momentum":0.9,"freeze_blocks":0,"gradient_clip":1,"legal":true}

LR Schedule

cosine decay

parameters: {"warmdown_steps":3500}

Regularization

layerwise LN scale

parameters: {"formula":"1/sqrt(layer+1)"}

Other

other

Parameter Banking with batched Newton-Schulz orthogonalization and async reduce-scatter/all-gather to speed up training.

parameters: {"step_time_ms":83.4}

Novel Contributions

LeakyReLU(0.5)^2 activation replacing standard relu^2
Legal score-first test-time training under torch.inference_mode()
Parallel Muon / Parameter Banking optimizer stack
All-block-unfrozen TTT adaptation (freeze=0) with 3 epochs
GPTQ-lite int6 quantization with lzma compression