PR #2106

open

Record: SP8192+DepthRec+Half batch SWA+Polar NS+Phased LoRa TTT - val_bpb 1.089 (best), val_bpb 1.090 (3-seed mean) - PiyushDatta

by PiyushDattaView on GitHub

val_bpb

1.0893

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,999,684 bytes

Training Techniques

Architecture

depth recurrence

Layers 3-5 are looped once, giving 14 effective passes from 11 unique layers.

parameters: {"layers":[3,4,5],"passes":14}

LeakyReLU

Uses LeakyReLU(0.5)^2 as the MLP activation.

parameters: {"slope":0.5}

GQA

Uses grouped query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

Quantization

GPTQ

bits: 6

scope: weights and embeddings

mixed int6/int8

bits: null

scope: weights and embeddings

Weight Averaging

SWA

parameters: {"start_scale":0.12,"frequency":"every step"}

Compression

brotli

level: null

Test-Time Training

LoRA TTT

parameters: {"phased":true,"score_first":true}

Optimizer

Muon

weight_decay: 0.095

momentum: 0.95

other_params: {"matrix_lr":0.028,"embed_wd":0.085,"embed_optimizer":"AdamW"}

LR Schedule

warmdown

parameters: {"warmdown_fraction":0.72}

Sequence Length

sequence_length

train_length: 393216

eval_length: null

Other

other

Uses SP8192 tokenizer with an 8x larger vocabulary than the SP1024 baseline.

parameters: {"vocab_size":8192}

other

Uses Polar Express Newton-Schulz coefficients for the Muon optimizer.

parameters: null

Novel Contributions

Multi-trajectory SWA with independent per-rank warmdown trajectories and cross-rank averaging
Scale tuning post-GPTQ by freezing int weights and fine-tuning only per-row scales
Two-pass GPTQ with Hessian recollection on the quantized model
Selective training-time 2:4 sparsity pruning on MLP weights
SP8192 tokenizer with GPTQ embeddings and SDClip-style quantization
Depth recurrence in layers 3-5
Polar Express Newton-Schulz optimizer coefficients
Phased LoRA test-time training