PR #1809

open

Record: SP8192 + Gram-NS + Polar Express + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0800 (3-seed mean)

by PranavViswanathView on GitHub

val_bpb

1.0800

Architecture

Transformer

Optimizer

Muon

Artifact Size

~16.02 MB

Training Techniques

Quantization

GPTQ

bits: 6

scope: attention/MLP matrices

GPTQ

bits: 8

scope: embeddings

GPTQ

bits: 6

scope: all model weights

Architecture

depth recurrence

3-layer recurrence loops layers 3-5 twice, creating 17 virtual layers from 11 physical layers.

parameters: {"layers":[3,4,5],"loops":2,"virtual_layers":17,"physical_layers":11}

Parallel Residuals

Attention and MLP read from the same pre-residual input in later layers.

parameters: {"start_layer":7}

weight tying

Tied embeddings are used.

parameters: null

Partial RoPE

Uses partial rotary position embeddings on a subset of dimensions.

parameters: {"dimensions":"16/64"}

LeakyReLU

Uses LeakyReLU squared activation in the MLP.

parameters: {"slope":0.5}

GQA

Grouped-query attention with 8 heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

MLP3x

Uses a 4x MLP expansion.

parameters: {"multiplier":4}

Weight Averaging

EMA

parameters: {"decay":0.9965}

Optimizer

Muon

weight_decay: 0.095

momentum: null

other_params: {"gram_ns":true,"polar_express":true}

AdamW

weight_decay: null

momentum: null

other_params: {"scope":"embeddings/scalars"}

LR Schedule

warmdown

parameters: {"warmdown":0.72,"min_lr":0.1}

Regularization

layerwise LN scale

parameters: null

logit softcap

parameters: {"value":30}

weight decay

parameters: {"value":0.095}

weight decay

parameters: {"value":0.022}

Evaluation

sliding window eval

parameters: null

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.005,"momentum":0.9,"epochs":3,"chunk_size":32000}

Novel Contributions

Gram-NS for rectangular MLP matrices using Gram-matrix Newton-Schulz iterations
Polar Express per-iteration minimax Newton-Schulz coefficients
4 Newton-Schulz steps with extended training budget recovery
Reduced GPTQ reserve time to recover additional training time
3-layer depth recurrence with 17 virtual layers
Parallel residuals in later layers
QK-Gain 5.25
Legal score-first test-time training