PR #686
openRecord: Depth Recurrence (layers 4 and 5 repeated): val_bpb 1.1182
by msisovicView on GitHub
val_bpb
1.1182
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.9MB
Training Techniques
Architecture
depth recurrence
Re-executes mid-network layers with independent learnable block scalars to create more virtual layers without increasing model size much.
parameters: {"recur_layers":[4,5],"physical_layers":11,"virtual_layers":13}
Quantization
int6
bits: 6
scope: all
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"warmup_from":0.92,"warmup_steps":1500}
Weight Averaging
SWA
parameters: {"every":50}
Evaluation
stride-based eval
parameters: {"stride":64}
Test-Time Training
full TTT
parameters: {"learning_rate":0.002,"epochs":3,"chunk_tokens":32768,"freeze_blocks":2,"untie":false}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Regularization
weight decay
parameters: {"matrix_weight_decay":0.04,"adam_weight_decay":0.04}
Other
other
Uses independent learnable block scalars for recurrent layer passes.
parameters: {"added_params":"~2K"}
Novel Contributions
- Dual depth recurrence on layers 4 and 5 to create 13 virtual layers from 11 physical layers
- Independent learnable block scalars for repeated layer passes
- Achieves near-independent-depth performance gains while staying under the artifact budget
- Confirms tied TTT performs equivalently to untied for recurrent layers