PR #967

open

Record: 1.0450 BPB — SGD TTT + HedgeMixer with Per-Layer LR Groups

by dexhunterView on GitHub

val_bpb

1.0450

Architecture

Transformer

Optimizer

SGD

Artifact Size

15.67MB

Training Techniques

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.002,"optimizer":"SGD","momentum":0.9}

full TTT

parameters: {"epochs":4,"zero_frozen_blocks":true,"skip_sliding_eval":true}

Optimizer

SGD

weight_decay: null

momentum: 0.9

other_params: {"learning_rate":0.002}

Architecture

BigramHash

Bigram hash feature module used in the base architecture.

parameters: {"size":6144,"dim":128}

SmearGate

Gating component paired with BigramHash in the base architecture.

parameters: null

XSA

XSA applied across all layers in the inherited architecture.

parameters: {"layers":11}

Partial RoPE

Rotary positional embeddings applied to only part of the head dimensions.

parameters: {"dimensions":"16/64"}

LeakyReLU

LeakyReLU squared activation used in the MLP.

parameters: {"squared":true,"negative_slope":0.5}

KV head count

Uses 8 KV heads with full multi-head attention.

parameters: {"kv_heads":8}

MLP3x

MLP expansion in the base architecture.

parameters: {"expansion":3.5}

weight tying

Tied embeddings are implied by the canonical method vocabulary only if explicitly mentioned; not clearly stated here.

parameters: null

Quantization

GPTQ-lite

bits: 5

scope: base model

Compression

zstd

level: null

Weight Averaging

EMA

parameters: {"decay":0.997}

Regularization

LN scale

parameters: null

LR Schedule

cosine decay

parameters: {"within_ttt":true}

Other

other

Per-layer learning-rate groups for TTT, with higher LR for output projections and lower LR for input projections.

parameters: {"output_projections_lr_multiplier":3,"input_projections_lr_multiplier":0.5}

other

HedgeMixer with backward-looking experts over scored tokens.

parameters: {"experts":["Neural","Unigram","Bigram","Trigram","Entropy"]}

Novel Contributions

Switched TTT from AdamW to SGD with momentum for a large BPB improvement
Added per-layer TTT learning-rate groups
Used cosine LR decay within TTT
Combined SGD TTT with HedgeMixer for the best reported score
Verified the method with a 3-seed evaluation and ablations