PR #953

open

Record: 1.0722 BPB — Improved TTT + HedgeMixer with Per-Layer LR Groups

by dexhunterView on GitHub

val_bpb

1.0722

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

15.66 MB

Training Techniques

Architecture

XSA

XSA applied across all 11 layers in the base architecture.

parameters: {"layers":11}

BigramHash

Bigram hash embedding with SmearGate in the context mixer.

parameters: {"size":6144,"dim":128}

SmearGate

Gating component paired with BigramHash.

parameters: null

Partial RoPE

Rotary positional encoding applied to only part of the head dimensions.

parameters: {"dimensions":"16/64"}

LeakyReLU

LeakyReLU squared activation used in the MLP.

parameters: {"squared":true,"alpha":0.5}

KV head count

Full multi-head attention with equal query and KV head counts.

parameters: {"heads":8,"kv_heads":8}

MLP3x

MLP expansion used in the base model.

parameters: {"expansion":3.5}

Quantization

GPTQ-lite

bits: 5

scope: all

Compression

zstd

level: null

Optimizer

Parallel Muon

weight_decay: null

momentum: null

other_params: null

Weight Averaging

EMA

parameters: {"decay":0.997}

Regularization

LN scale

parameters: null

Test-Time Training

score-first TTT

parameters: {"epochs":4,"freeze_blocks":1,"learning_rate":0.0005,"chunk_tokens":32768}

LR Schedule

cosine decay

parameters: {"within_ttt":true}

Evaluation

sliding window eval

parameters: {"skipped":true}

Novel Contributions

Per-layer learning-rate groups for TTT, with higher LR on output projections and lower LR on input projections
Cosine learning-rate schedule within TTT to adapt aggressively early and anneal later
Increased TTT to 4 epochs while freezing only 1 block
Skipped standalone sliding window evaluation to reclaim eval budget for the extra TTT epoch
Improved HedgeMixer + legal TTT stack over PR #720