PR #1485

open

Record: SP8192 + 3-Layer Depth Recurrence + Parallel Residuals + EMA + QK5 + Pre-Quant AdamW TTT — val_bpb 1.0679 (3-seed mean)

by ndokutovichView on GitHub

val_bpb

1.0679

Architecture

Transformer

Optimizer

MuonEq-R

Artifact Size

~15.95 MB

Training Techniques

Architecture

depth recurrence

3-layer recurrence with layers 3, 4, and 5 repeated to create 13 virtual layers from 11 physical layers.

parameters: {"layers":3,"physical_layers":11,"virtual_layers":13,"repeat_layers":[3,4,5]}

parallel residuals

GPT-J style parallel residual pathway starting from layer 7.

parameters: {"start_layer":7}

QK-Gain

Learnable per-head QK gain applied to Q only.

parameters: {"gain":5}

LeakyReLU

Squared LeakyReLU activation with slope 0.5.

parameters: {"slope":0.5,"squared":true}

GQA

Grouped query attention with 8 heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

U-Net skip connections

Sigmoid-gated U-Net style skip connections.

parameters: null

Weight Averaging

EMA

parameters: {"decay":0.9965}

Test-Time Training

AdamW TTT

parameters: {"epochs":6,"learning_rate":0.0005,"freeze_blocks":2,"schedule":"cosine decay","pre_quant":true}

Quantization

GPTQ

bits: 6

scope: weights

int8

bits: 8

scope: embeddings

Compression

brotli

level: null

Optimizer

MuonEq-R

weight_decay: null

momentum: null

other_params: {"row_normalized_newton_schulz":true}

LR Schedule

cosine decay

parameters: null

Novel Contributions

First submission combining depth recurrence, parallel residuals, EMA, QK-Gain, pre-quant AdamW TTT, and SDClip GPTQ int6 in one stack.
3-layer depth recurrence with layers 3, 4, and 5 repeated to expand 11 physical layers into 13 virtual layers.
GPT-J style parallel residuals starting from layer 7.
Pre-quant AdamW test-time training on validation data before quantization, baked into the final artifact.
SDClip GPTQ int6 with int8 embeddings and brotli compression.
Achieved a 3-seed mean val_bpb of 1.0679.