PR #1992

closed

Record: SP8192 Full Stack + Headwise Gated Attention + PreQuantTTT (1.0511 BPB, 3-seed)

by jamesEmerson112View on GitHub
val_bpb
1.0511
Architecture
Transformer
Optimizer
MuonEq-R
Artifact Size
~15.74 MB

Training Techniques

Architecture
Gated Attention
Post-attention sigmoid gate applied per head after FA3+XSA; Q projection widened by extra gate dimensions to modulate each head's contribution before output projection.
parameters: {"gate_dim":null,"heads":8}
weight tying
Tied embeddings are used.
parameters: null
depth recurrence
Layers 3-5 are looped to create virtual layers from the physical stack.
parameters: {"loops":2}
Parallel Residuals
GPT-J style parallel residual connections from later layers.
parameters: {"start_layer":7}
U-Net skip connections
Sigmoid-gated skip connections bridging encoder and decoder paths.
parameters: null
Partial RoPE
Rotary position embeddings applied to a subset of dimensions.
parameters: {"dimensions":16,"base_dimensions":64}
XSA
Exclusive Self-Attention used on all layers.
parameters: {"layers":11}
Weight Averaging
EMA
parameters: {"decay":0.99}
Optimizer
Muon
weight_decay: 0.095
momentum: null
other_params: {"variant":"MuonEq-R","newton_schulz_steps":5}
AdamW
weight_decay: 0.02
momentum: null
other_params: {"used_for":"embeddings and scalars"}
Regularization
logit softcap
parameters: {"value":30}
layerwise LN scale
parameters: {"scale_rule":"1/sqrt(layer+1)"}
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"chunk_size":32000,"epochs_per_chunk":3,"learning_rate":0.005,"momentum":0.9}
PreQuantTTT
parameters: {"epochs":21,"optimizer":"AdamW"}
Quantization
GPTQ
bits: null
scope: attention/MLP matrices and embeddings
mixed int6/int7
bits: 6
scope: attention/MLP matrices
mixed int6/int7
bits: 7
scope: token embeddings
Compression
Brotli
level: 11
LR Schedule
warmdown
parameters: {"final_fraction":0.72}
Sequence Length
sequence_length
train_length: null
eval_length: 32000

Novel Contributions

  • Headwise gated attention with per-head sigmoid gating after attention computation
  • Systematic 29-paper survey to identify techniques that transfer to the 36M-parameter regime
  • EMA decay scaling law showing stronger averaging is better at short training durations
  • Orthogonal stacking of small batch training, EMA tuning, and PreQuantTTT
  • Ablation study documenting multiple techniques that fail to transfer at 36M scale