PR #1770

open

Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT + V-Gated — val_bpb 1.0796 (3-seed mean)

by liujshiView on GitHub
val_bpb
1.0796
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.99 MB

Training Techniques

Architecture
depth recurrence
3-layer recurrence with looping encoder/decoder layer patterns.
parameters: {"layers":3}
U-Net skip connections
Skip connections used as sigmoid-gated U-Net style connections.
parameters: null
Parallel residuals
Attention and MLP operate on the same pre-residual input from layer 7 onward.
parameters: null
SmearGate
Learnable gate to smooth representations and improve compression friendliness.
parameters: null
Gated Attention
Per-head V-Gate where the V projection controls both attention input and head contribution to output.
parameters: {"type":"per-head V-Gate"}
Partial RoPE
Rotary position embeddings applied to only part of the dimensions.
parameters: {"dimensions":"16/64"}
weight tying
Tied embeddings.
parameters: null
MLP3x
MLP width multiplier of 4x with LeakyReLU activation.
parameters: {"multiplier":4}
LeakyReLU
LeakyReLU activation used in the MLP.
parameters: {"slope":0.5}
Regularization
logit softcap
parameters: {"value":30}
layerwise LN scale
parameters: null
Test-Time Training
Legal TTT
parameters: {"learning_rate":0.01,"epochs":3}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"MUON_BACKEND_STEPS":4}
Compression
custom
level: null

Novel Contributions

  • Added a learnable final norm scale and SmearGate to smooth representations and reduce artifact size.
  • Added a per-head V-Gate to control both attention input and head output contribution.
  • Improved quantized compression with per-matrix automatic layout selection.
  • Performed additional hyperparameter tuning including MUON_BACKEND_STEPS=4 and TTT_LR=0.01.
  • Achieved a 3-seed mean val_bpb of 1.0796 with approximately 15.99 MB artifact size.