PR #1087

open

Record: 1.1407 BPB — LeakyReLU^2 + Delayed QAT + Score-First TTT

by DhenenjayView on GitHub

val_bpb

1.1407

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.7 MB

Training Techniques

Quantization

STE QAT

bits: 5

scope: MLP

STE QAT

bits: 6

scope: attention

GPTQ-lite

bits: null

scope: all

late QAT

bits: 5

scope: MLP

Architecture

LeakyReLU

Uses LeakyReLU(0.5)^2 in the MLP.

parameters: {"slope":0.5,"squared":true}

Partial RoPE

Applies RoPE to only part of the head dimensions.

parameters: {"dimensions":16,"total_dimensions":64}

BigramHash

Uses a bigram hash embedding component.

parameters: {"vocab_size":6144}

SmearGate

Includes SmearGate in the architecture.

parameters: null

Regularization

LN scale

parameters: {"scale":"1/sqrt(l+1)"}

Weight Averaging

EMA + SWA

parameters: {"ema_decay":0.997,"swa_every":50}

Optimizer

SGD

weight_decay: null

momentum: 0.95

other_params: {"learning_rate":0.005}

AdamW

weight_decay: null

momentum: null

other_params: {"learning_rate":0.035}

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.005,"epochs":3,"chunk_tokens":16384,"freeze_blocks":0}

Evaluation

sliding window eval

parameters: {"stride":64}

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Novel Contributions

Delayed QAT with quantization noise injected only after step 5500
Score-first TTT with strong post-training gains
LeakyReLU(0.5)^2 combined with Partial RoPE and LN Scale
GPTQ-lite per-row optimal clip percentile search
EMA and SWA combined with delayed quantization-aware training