PR #450

open

Record: 12L + Catalytic Residuals + BigramHash(10240) + SWA + Late QAT (val_bpb=1.1466, mean 3 seeds)

by zachgoldfine44View on GitHub

val_bpb

1.1466

Architecture

Transformer

Optimizer

Muon

Artifact Size

14,385,363 bytes

Training Techniques

Architecture

Catalytic Residual Connections

Replace x + f(x) with x + c * f(x), where c is a learned per-dimension vector initialized to ones.

parameters: null

depth

Use a 12-layer Transformer stack.

parameters: {"layers":12}

BigramHash

Hash consecutive token pairs into a larger bigram embedding table and project to model dimension.

parameters: {"vocab_size":10240,"dim":128}

XSA

Cross-sequence attention applied on the last 4 layers.

parameters: {"layers":4}

KV head count

Grouped-query attention with 4 KV heads and 8 attention heads.

parameters: {"heads":8,"kv_heads":4}

MLP3x

MLP with 3x expansion and relu^2 activation.

parameters: {"hidden":1536}

Quantization

STE QAT

bits: 6

scope: all

Weight Averaging

SWA

parameters: {"start_fraction":0.8,"every_steps":50}

Optimizer

Muon

weight_decay: 0.042

momentum: 0.95

other_params: {"matrix_lr":0.04}

AdamW

weight_decay: 0.042

momentum: null

other_params: {"scope":"embeddings/scalars"}

Compression

zstd

level: null

Evaluation

sliding window eval

parameters: {"stride":64}

Initialization

OrthoInit

Orthogonal initialization with muP-scaled output projections.

Regularization

layerwise LN scale

parameters: {"scale":"1/sqrt(layer_idx+1)"}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

warmdown

parameters: {"warmdown_iters":4000,"warmup_steps":20}

Other

other

Late QAT with threshold 0.25 using STE int6 quantization in the final portion of training.

parameters: {"threshold":0.25}

Novel Contributions

Catalytic residual connections with learned per-dimension residual scaling
12-layer depth scaling as a sweet spot under the budget
BigramHash with 10240 buckets
Late QAT using STE int6 quantization
Stochastic Weight Averaging from the last 20% of warmdown