PR #1485

open

Record: SP8192 + 3-Layer Depth Recurrence + Parallel Residuals + EMA + QK5 + Pre-Quant AdamW TTT — val_bpb 1.0679 (3-seed mean)

by ndokutovichView on GitHub
val_bpb
1.0679
Architecture
Transformer
Optimizer
MuonEq-R
Artifact Size
~15.95 MB

Training Techniques

Architecture
depth recurrence
3-layer recurrence with layers 3, 4, and 5 repeated to create 13 virtual layers from 11 physical layers.
parameters: {"layers":3,"physical_layers":11,"virtual_layers":13,"repeat_layers":[3,4,5]}
parallel residuals
GPT-J style parallel residual pathway starting from layer 7.
parameters: {"start_layer":7}
QK-Gain
Learnable per-head QK gain applied to Q only.
parameters: {"gain":5}
LeakyReLU
Squared LeakyReLU activation with slope 0.5.
parameters: {"slope":0.5,"squared":true}
GQA
Grouped query attention with 8 heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
U-Net skip connections
Sigmoid-gated U-Net style skip connections.
parameters: null
Weight Averaging
EMA
parameters: {"decay":0.9965}
Test-Time Training
AdamW TTT
parameters: {"epochs":6,"learning_rate":0.0005,"freeze_blocks":2,"schedule":"cosine decay","pre_quant":true}
Quantization
GPTQ
bits: 6
scope: weights
int8
bits: 8
scope: embeddings
Compression
brotli
level: null
Optimizer
MuonEq-R
weight_decay: null
momentum: null
other_params: {"row_normalized_newton_schulz":true}
LR Schedule
cosine decay
parameters: null

Novel Contributions

  • First submission combining depth recurrence, parallel residuals, EMA, QK-Gain, pre-quant AdamW TTT, and SDClip GPTQ int6 in one stack.
  • 3-layer depth recurrence with layers 3, 4, and 5 repeated to expand 11 physical layers into 13 virtual layers.
  • GPT-J style parallel residuals starting from layer 7.
  • Pre-quant AdamW test-time training on validation data before quantization, baked into the final artifact.
  • SDClip GPTQ int6 with int8 embeddings and brotli compression.
  • Achieved a 3-seed mean val_bpb of 1.0679.