PR #1715

open

Record: QK-Gain 5.5 — val_bpb 1.0809 (3-seed mean)

by G3sparkyView on GitHub

val_bpb

1.0809

Architecture

Transformer

Optimizer

SGD

Artifact Size

~16.0 MB

Training Techniques

Quantization

GPTQ

bits: 6

scope: matrices

GPTQ

bits: 8

scope: embeddings

Architecture

depth recurrence

3-layer recurrence loop applied to layers 3-5, activated partway through training.

parameters: {"layers":[3,4,5],"num_loops":2,"activate_frac":0.35}

U-Net skip connections

Skip-gated U-Net style connections added to the network.

parameters: null

weight tying

Tied input and output embeddings.

parameters: null

Partial RoPE

Rotary position embeddings applied to a subset of dimensions.

parameters: {"dimensions":16,"total_dimensions":64}

LeakyReLU

LeakyReLU activation used in the MLP.

parameters: {"slope":0.5}

MLP3x

Expanded MLP width to 4x.

parameters: {"multiplier":4}

Regularization

logit softcap

parameters: {"value":30}

layerwise LN scale

parameters: null

Optimizer

SGD

weight_decay: 0.095

momentum: 0.9

other_params: {"learning_rate":0.005,"epochs_per_chunk":3}

Muon

weight_decay: null

momentum: null

other_params: {"variant":"MuonEq-R","row_normalized":true,"newton_schulz_steps":5}

Weight Averaging

EMA

parameters: {"decay":0.9965}

Evaluation

sliding window eval

parameters: null

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.005,"epochs":3}

LR Schedule

warmdown

parameters: {"frac":0.72}

Novel Contributions

QK_GAIN_INIT increased to 5.5, extending the monotonic improvement trend beyond 5.25
3-seed record result with mean val_bpb 1.0809
Combination of SP8192, 3-layer depth recurrence, parallel residuals, and legal TTT
Legal score-first test-time training under Track B constraints