PR #1514

RECORDopen

Record: SP8192 + Muon 0.97 + Legal Score-First TTT — val_bpb 1.07983 (3-seed mean)

by dexhunterView on GitHub

val_bpb

1.0798

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.99 MB

Training Techniques

Optimizer

Muon

weight_decay: 0.085

momentum: 0.97

other_params: {"warmup_start":0.92,"warmup_end":0.97,"warmup_steps":1500}

Architecture

weight tying

Tied token embeddings.

parameters: null

depth recurrence

Loops layers 3-5 twice during training.

parameters: {"layers":[3,5],"repeats":2}

Partial RoPE

Partial rotary position embeddings applied to a subset of dimensions.

parameters: {"dimensions":16}

LeakyReLU

LeakyReLU^2 activation used in the MLP.

parameters: {"slope":0.5}

KV head count

Uses 4 KV heads.

parameters: {"kv_heads":4}

Regularization

layerwise LN scale

parameters: null

Quantization

GPTQ

bits: 6

scope: all weights

int8

bits: 8

scope: embeddings

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.005,"epochs":3,"freeze_blocks":0,"chunk_tokens":32768}

Evaluation

sliding window eval

parameters: null

Other

other

Causal token n-gram tilt using a prefix-only token expert; within-word and word-start experts disabled for legality.

parameters: {"base_beta":2,"agree_bonus":0.1,"within_beta":0,"word_beta":0}

Compression

lzma

level: null

Novel Contributions

Muon momentum reduced to 0.97, improving validation BPB over the default 0.99 setting.
Legal score-first test-time training where each chunk is scored before any gradient update.
Causal token n-gram tilt using only the prefix-only token expert with within-word and word-start experts disabled.
Combined SP8192 baseline with legal TTT and causal n-gram tilt to reach a new 3-seed mean record.