PR #1992

closed

Record: SP8192 Full Stack + Headwise Gated Attention + PreQuantTTT (1.0511 BPB, 3-seed)

by jamesEmerson112View on GitHub

val_bpb

1.0511

Architecture

Transformer

Optimizer

MuonEq-R

Artifact Size

~15.74 MB

Training Techniques

Architecture

Gated Attention

Post-attention sigmoid gate applied per head after FA3+XSA; Q projection widened by extra gate dimensions to modulate each head's contribution before output projection.

parameters: {"gate_dim":null,"heads":8}

weight tying

Tied embeddings are used.

parameters: null

depth recurrence

Layers 3-5 are looped to create virtual layers from the physical stack.

parameters: {"loops":2}

Parallel Residuals

GPT-J style parallel residual connections from later layers.

parameters: {"start_layer":7}

U-Net skip connections

Sigmoid-gated skip connections bridging encoder and decoder paths.

parameters: null

Partial RoPE

Rotary position embeddings applied to a subset of dimensions.

parameters: {"dimensions":16,"base_dimensions":64}

XSA

Exclusive Self-Attention used on all layers.

parameters: {"layers":11}

Weight Averaging

EMA

parameters: {"decay":0.99}

Optimizer

Muon

weight_decay: 0.095

momentum: null

other_params: {"variant":"MuonEq-R","newton_schulz_steps":5}

AdamW

weight_decay: 0.02

momentum: null

other_params: {"used_for":"embeddings and scalars"}

Regularization

logit softcap

parameters: {"value":30}

layerwise LN scale

parameters: {"scale_rule":"1/sqrt(layer+1)"}

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

score-first TTT

parameters: {"chunk_size":32000,"epochs_per_chunk":3,"learning_rate":0.005,"momentum":0.9}

PreQuantTTT

parameters: {"epochs":21,"optimizer":"AdamW"}

Quantization

GPTQ

bits: null

scope: attention/MLP matrices and embeddings

mixed int6/int7

bits: 6

scope: attention/MLP matrices

mixed int6/int7

bits: 7

scope: token embeddings

Compression

Brotli

level: 11

LR Schedule

warmdown

parameters: {"final_fraction":0.72}

Sequence Length

sequence_length

train_length: null

eval_length: 32000

Novel Contributions

Headwise gated attention with per-head sigmoid gating after attention computation
Systematic 29-paper survey to identify techniques that transfer to the 36M-parameter regime
EMA decay scaling law showing stronger averaging is better at short training durations
Orthogonal stacking of small batch training, EMA tuning, and PreQuantTTT
Ablation study documenting multiple techniques that fail to transfer at 36M scale