PR #1809

open

Record: SP8192 + Gram-NS + Polar Express + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0800 (3-seed mean)

by PranavViswanathView on GitHub
val_bpb
1.0800
Architecture
Transformer
Optimizer
Muon
Artifact Size
~16.02 MB

Training Techniques

Quantization
GPTQ
bits: 6
scope: attention/MLP matrices
GPTQ
bits: 8
scope: embeddings
GPTQ
bits: 6
scope: all model weights
Architecture
depth recurrence
3-layer recurrence loops layers 3-5 twice, creating 17 virtual layers from 11 physical layers.
parameters: {"layers":[3,4,5],"loops":2,"virtual_layers":17,"physical_layers":11}
Parallel Residuals
Attention and MLP read from the same pre-residual input in later layers.
parameters: {"start_layer":7}
weight tying
Tied embeddings are used.
parameters: null
Partial RoPE
Uses partial rotary position embeddings on a subset of dimensions.
parameters: {"dimensions":"16/64"}
LeakyReLU
Uses LeakyReLU squared activation in the MLP.
parameters: {"slope":0.5}
GQA
Grouped-query attention with 8 heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
MLP3x
Uses a 4x MLP expansion.
parameters: {"multiplier":4}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Optimizer
Muon
weight_decay: 0.095
momentum: null
other_params: {"gram_ns":true,"polar_express":true}
AdamW
weight_decay: null
momentum: null
other_params: {"scope":"embeddings/scalars"}
LR Schedule
warmdown
parameters: {"warmdown":0.72,"min_lr":0.1}
Regularization
layerwise LN scale
parameters: null
logit softcap
parameters: {"value":30}
weight decay
parameters: {"value":0.095}
weight decay
parameters: {"value":0.022}
Evaluation
sliding window eval
parameters: null
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.005,"momentum":0.9,"epochs":3,"chunk_size":32000}

Novel Contributions

  • Gram-NS for rectangular MLP matrices using Gram-matrix Newton-Schulz iterations
  • Polar Express per-iteration minimax Newton-Schulz coefficients
  • 4 Newton-Schulz steps with extended training budget recovery
  • Reduced GPTQ reserve time to recover additional training time
  • 3-layer depth recurrence with 17 virtual layers
  • Parallel residuals in later layers
  • QK-Gain 5.25
  • Legal score-first test-time training