PR #1517

open

Record: Depth Recurrence + Banked Muon + Pre-Quant TTT (18ep) — val_bpb 1.0632 (3-seed mean)

by RulinShaoView on GitHub

val_bpb

1.0632

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.0 MB

Training Techniques

Architecture

depth recurrence

Reuses layers 3, 4, and 5 once to create 14 virtual layers from 11 physical layers, activated after step 2000.

parameters: {"layers":3,"start_step":2000,"physical_layers":11,"virtual_layers":14}

GQA

Grouped query attention with 8 attention heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

XSA

XSA enabled in all layers.

parameters: {"layers":"all"}

SmearGate

Learned token blending gate.

parameters: null

LeakyReLU

Leaky ReLU squared MLP activation.

parameters: null

Optimizer

Muon

weight_decay: 0.095

momentum: null

other_params: {"banked":true,"parallel":true}

Weight Averaging

EMA

parameters: {"decay":0.9965}

Test-Time Training

full TTT

parameters: {"epochs":18,"learning_rate":0.0003,"freeze_blocks":1}

Quantization

GPTQ

bits: 6

scope: all

mixed int6/int8

bits: null

scope: all

Regularization

LN scale

parameters: {"formula":"1/sqrt(layer+1)"}

LR Schedule

cosine decay

parameters: {"ttt":true}

warmdown

parameters: {"frac":0.72}

Compression

brotli

level: null

Novel Contributions

Depth recurrence integrated into a parameter-banked Parallel Muon architecture
Reuse of layers 3, 4, and 5 to create 14 virtual layers from 11 physical layers with zero extra parameters
Pre-quantization test-time training with AdamW for 18 epochs before quantization
Combination of depth recurrence, banked Muon, and SDClip GPTQ quantization into a single record submission