PR #1514
openRecord: SP8192 + Muon 0.97 + Legal Score-First TTT — val_bpb 1.07983 (3-seed mean)
by dexhunterView on GitHub
val_bpb
1.0798
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.99 MB
Training Techniques
Optimizer
Muon
weight_decay: 0.085
momentum: 0.97
other_params: {"warmup_start":0.92,"warmup_end":0.97,"warmup_steps":1500}
Architecture
weight tying
Tied token embeddings.
parameters: null
depth recurrence
Loops layers 3-5 twice during training.
parameters: {"layers":[3,5],"repeats":2}
Partial RoPE
Partial rotary position embeddings applied to a subset of dimensions.
parameters: {"dimensions":16}
LeakyReLU
LeakyReLU^2 activation used in the MLP.
parameters: {"slope":0.5}
KV head count
Uses 4 KV heads.
parameters: {"kv_heads":4}
Regularization
layerwise LN scale
parameters: null
Quantization
GPTQ
bits: 6
scope: all weights
int8
bits: 8
scope: embeddings
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.005,"epochs":3,"freeze_blocks":0,"chunk_tokens":32768}
Evaluation
sliding window eval
parameters: null
Other
other
Causal token n-gram tilt using a prefix-only token expert; within-word and word-start experts disabled for legality.
parameters: {"base_beta":2,"agree_bonus":0.1,"within_beta":0,"word_beta":0}
Compression
lzma
level: null
Novel Contributions
- Muon momentum reduced to 0.97, improving validation BPB over the default 0.99 setting.
- Legal score-first test-time training where each chunk is scored before any gradient update.
- Causal token n-gram tilt using only the prefix-only token expert with within-word and word-start experts disabled.
- Combined SP8192 baseline with legal TTT and causal n-gram tilt to reach a new 3-seed mean record.