PR #995

open

Record: 1.0362 BPB — SGD Momentum 0.95 TTT + HedgeMixer + Per-Layer LR

by dexhunterView on GitHub

val_bpb

1.0362

Architecture

Transformer

Optimizer

SGD

Artifact Size

15.67MB

Training Techniques

Optimizer

SGD

weight_decay: null

momentum: 0.95

other_params: {"learning_rate":0.002}

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.002,"momentum":0.95,"epochs":4,"freeze_depth":0}

Architecture

BigramHash

Hash-based token embedding component in the base architecture

parameters: {"size":6144}

Partial RoPE

Partial rotary positional encoding

parameters: null

XSA

XSA-all attention variant used in the base model

parameters: {"layers":11}

LeakyReLU

LeakyReLU activation in the MLP

parameters: {"slope":0.5}

MLP3x

Expanded MLP width

parameters: {"multiplier":3.5}

KV head count

Reduced KV head count relative to attention heads

parameters: {"heads":8,"kv_heads":8}

LogisticContextMixer

Backward-looking HedgeMixer with multiple experts

parameters: {"experts":5}

Quantization

GPTQ-lite

bits: 5

scope: model

Compression

zstd

level: null

LR Schedule

cosine decay

parameters: null

Sequence Length

sequence_length

train_length: 32000

eval_length: null

Regularization

weight decay

parameters: null

Novel Contributions

Switched TTT optimization from AdamW to SGD with momentum 0.95
Introduced per-layer learning-rate groups with higher LR for output projections and lower LR for input layers
Validated a best configuration using multi-seed sweeps and ablations
Combined score-first legal TTT with backward-looking HedgeMixer
Achieved a new record mean validation BPB of 1.0362