PR #1517
openRecord: Depth Recurrence + Banked Muon + Pre-Quant TTT (18ep) — val_bpb 1.0632 (3-seed mean)
by RulinShaoView on GitHub
val_bpb
1.0632
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.0 MB
Training Techniques
Architecture
depth recurrence
Reuses layers 3, 4, and 5 once to create 14 virtual layers from 11 physical layers, activated after step 2000.
parameters: {"layers":3,"start_step":2000,"physical_layers":11,"virtual_layers":14}
GQA
Grouped query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
XSA
XSA enabled in all layers.
parameters: {"layers":"all"}
SmearGate
Learned token blending gate.
parameters: null
LeakyReLU
Leaky ReLU squared MLP activation.
parameters: null
Optimizer
Muon
weight_decay: 0.095
momentum: null
other_params: {"banked":true,"parallel":true}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Test-Time Training
full TTT
parameters: {"epochs":18,"learning_rate":0.0003,"freeze_blocks":1}
Quantization
GPTQ
bits: 6
scope: all
mixed int6/int8
bits: null
scope: all
Regularization
LN scale
parameters: {"formula":"1/sqrt(layer+1)"}
LR Schedule
cosine decay
parameters: {"ttt":true}
warmdown
parameters: {"frac":0.72}
Compression
brotli
level: null
Novel Contributions
- Depth recurrence integrated into a parameter-banked Parallel Muon architecture
- Reuse of layers 3, 4, and 5 to create 14 virtual layers from 11 physical layers with zero extra parameters
- Pre-quantization test-time training with AdamW for 18 epochs before quantization
- Combination of depth recurrence, banked Muon, and SDClip GPTQ quantization into a single record submission