PR #1770

open

Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT + V-Gated — val_bpb 1.0796 (3-seed mean)

by liujshiView on GitHub

val_bpb

1.0796

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.99 MB

Training Techniques

Architecture

depth recurrence

3-layer recurrence with looping encoder/decoder layer patterns.

parameters: {"layers":3}

U-Net skip connections

Skip connections used as sigmoid-gated U-Net style connections.

parameters: null

Parallel residuals

Attention and MLP operate on the same pre-residual input from layer 7 onward.

parameters: null

SmearGate

Learnable gate to smooth representations and improve compression friendliness.

parameters: null

Gated Attention

Per-head V-Gate where the V projection controls both attention input and head contribution to output.

parameters: {"type":"per-head V-Gate"}

Partial RoPE

Rotary position embeddings applied to only part of the dimensions.

parameters: {"dimensions":"16/64"}

weight tying

Tied embeddings.

parameters: null

MLP3x

MLP width multiplier of 4x with LeakyReLU activation.

parameters: {"multiplier":4}

LeakyReLU

LeakyReLU activation used in the MLP.

parameters: {"slope":0.5}

Regularization

logit softcap

parameters: {"value":30}

layerwise LN scale

parameters: null

Test-Time Training

Legal TTT

parameters: {"learning_rate":0.01,"epochs":3}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"MUON_BACKEND_STEPS":4}

Compression

custom

level: null

Novel Contributions

Added a learnable final norm scale and SmearGate to smooth representations and reduce artifact size.
Added a per-head V-Gate to control both attention input and head output contribution.
Improved quantized compression with per-matrix automatic layout selection.
Performed additional hyperparameter tuning including MUON_BACKEND_STEPS=4 and TTT_LR=0.01.
Achieved a 3-seed mean val_bpb of 1.0796 with approximately 15.99 MB artifact size.