PR #2009

open

Record: DepthShare4096 + SparseAttnGate + Muon TTT - val_bpb 1.0500312

by SlavHView on GitHub

val_bpb

1.0500

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,921,334 bytes

Training Techniques

Architecture

depth recurrence

8 base layers are reused for 3 recurrent passes, giving effective 24-layer depth with weight tying.

parameters: {"layers":8,"recurrent_passes":3,"effective_depth":24}

weight tying

Input/output embeddings are tied and recurrent blocks share weights across passes.

parameters: null

GQA

Uses grouped query attention with fewer KV heads than attention heads.

parameters: {"n_head":8,"n_kv_head":2}

Partial RoPE

Applies rotary embeddings to only part of the head dimensions.

parameters: {"rotary_pct":0.5}

SparseAttnGate

Learned per-head gating sparsifies attention weights below a threshold.

parameters: null

Test-Time Training

TTT

parameters: {"mode":"backward-only","adaptation_target":"layer norms"}

Compression

zlib

level: null

Optimizer

Muon

weight_decay: 0.01

momentum: 0.95

other_params: {"nesterov":true,"ns_steps":6,"lr":0.0095,"warmup_steps":200,"schedule":"cosine decay"}

LR Schedule

cosine decay

parameters: {"warmup_steps":200,"final_lr_multiplier":0.1}

Sequence Length

sequence_length

train_length: 1024

eval_length: 1024

Regularization

weight decay

parameters: {"value":0.01}

Novel Contributions

DepthShare-4096 depth-recurrent transformer with 8 layers reused for 3 passes
4096-token BPE tokenizer to improve bytes-per-byte performance
SparseAttnGate attention sparsification
Partial RoPE with rotary_pct=0.5
Muon optimizer tuned with Newton-Schulz steps and momentum settings
Backward-only test-time training on layer norms
3-seed statistically significant improvement over prior SOTA