PR #1128

open

Record: SLOT + LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1154 (3-seed mean) val_bpb = 1.1154 (3-seed mean, std 0.0002) | ~15.9 MB | 8×H100 SXM

by AnubhavBharadwaajView on GitHub

val_bpb

1.1154

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

~15.9 MB

Training Techniques

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.002,"epochs":3,"chunk_tokens":32768,"freeze_blocks":0,"momentum":0.9,"batch_seqs":32,"grad_clip":1}

Other

other

SLOT (sample-specific LM optimization at test time) optimizing a per-batch additive delta on the last hidden layer during evaluation

parameters: {"delta_shape":[1,1,512],"steps":5,"learning_rate":0.003}

Architecture

LeakyReLU

LeakyReLU squared MLP activation used in the model

parameters: {"mlp_layers":3}

BigramHash

Bigram hash embedding component

parameters: {"vocab_size":1536}

XSA

XSA attention-related modification

parameters: {"last_n":4}

Partial RoPE

Partial rotary positional embeddings

parameters: {"dimensions":16}

VE128

Value residual enhancement module

parameters: {"dim":128,"layers":[9,10]}

Regularization

LN scale

parameters: null

Weight Averaging

EMA + Tight SWA

parameters: {"decay":0.997,"swa_every":50}

Quantization

GPTQ-lite

bits: 6

scope: all

Compression

lzma

level: 6

Evaluation

stride-based eval

parameters: {"stride":64}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"warmup_start":0.92,"warmup_steps":1500}

AdamW

weight_decay: 0.04

momentum: null

other_params: {"eps":0.00001}

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Novel Contributions

First SLOT-based entry in Parameter Golf
Per-batch test-time optimization of a 512-dimensional delta at the last hidden layer
Combination of SLOT with legal score-first TTT
Parallel Muon-based training with the existing PR #549 base architecture
Record-setting 3-seed mean val_bpb of 1.1154