PR #1523

open

Record: SP8192 + Triple Recurrence + Banking + Fused MLP + Muon 0.97 — val_bpb 1.0778 (3-seed mean)

by EthanYangTWView on GitHub

val_bpb

1.0778

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

~15.99 MB

Training Techniques

Architecture

depth recurrence

Triple depth recurrence with 17 virtual layers from 11 physical layers; layers 3, 4, and 5 looped multiple times and enabled partway through training.

parameters: {"physical_layers":11,"virtual_layers":17,"loop_layers":[3,4,5],"activation_start":35}

BigramHash

Eval-time hash embedding using a bigram hash over prefix tokens, with a zero-initialized learned embedding trained during TTT.

parameters: {"vocab_size":16384,"embedding_dim":512}

LeakyReLU

Fused MLP uses LeakyReLU with squared activation in the MLP path.

parameters: {"negative_slope":0.5}

Optimizer

Parallel Muon

weight_decay: 0.095

momentum: 0.97

other_params: {"lr":0.022}

Weight Averaging

EMA

parameters: {"decay":0.997}

Quantization

GPTQ

bits: 6

scope: weights

int8

bits: 8

scope: embeddings

Compression

brotli

level: null

Evaluation

sliding window eval

parameters: null

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.01}

LR Schedule

warmdown

parameters: {"warmdown_steps":66.7}

Regularization

logit softcap

parameters: {"value":30}

Novel Contributions

Parameter banking with parallel Muon across 4 contiguous banks
Batched Newton-Schulz optimizer step for faster training
Fused MLP Triton TMA kernel combining fc, LeakyReLU, and square
Muon momentum reduced to 0.97
Triple depth recurrence with 17 virtual layers
Eval-time BigramHash embedding trained during TTT
TTT learning rate tuned to 0.01
Score-first TTT compliance