PR #341

open

Add Hybrid Depth-Recurrent Transformer submission

by tobiascanavesiView on GitHub
val_bpb
1.3323
Architecture
Hybrid Depth-Recurrent Transformer
Optimizer
Muon
Artifact Size
14.2 MB

Training Techniques

Quantization
int8
bits: 8
scope: model weights with FP16 tied embedding passthrough
Architecture
depth recurrence
Hybrid depth-recurrent transformer with 1 unique entry layer, 4 shared blocks looped 5 times, and 1 unique exit layer to reduce quantization compounding.
parameters: {"unique_entry_layers":1,"shared_blocks":4,"loops":5,"unique_exit_layers":1,"effective_layers":22}
U-Net skip connections
Skip connections across the full effective depth.
parameters: null
per-layer scalars
Per-virtual-layer scalars controlling attention, MLP, residual mixing, and quantization gain.
parameters: {"scalars":["attn_scale","mlp_scale","resid_mix","q_gain"]}
Optimizer
Muon
weight_decay: 0.02
momentum: null
other_params: {"matrix_lr":0.03,"scalar_lr":0.03,"tied_embed_lr":0.04}
Evaluation
sliding window eval
parameters: {"stride":64,"seq_len":1024}
Initialization
spectral init
Overtone spectral embedding initialization using SVD power-law spectrum shaping.
resid mix
Phase-transition residual mixing initialization with sigmoid-scheduled resid_mix.
Sequence Length
sequence_length
train_length: null
eval_length: 1024
LR Schedule
warmdown
parameters: {"warmdown_iters":2500}
Regularization
weight decay
parameters: {"value":0.02}
Other
other
FP16 tied embedding passthrough during int8 quantization.
parameters: null

Novel Contributions

  • Hybrid depth-recurrent transformer that keeps entry and exit layers unique while sharing only middle blocks
  • Reduction of int8 quantization error compounding in depth recurrence
  • Near-zero quantization gap compared with pure depth recurrence
  • U-Net skip connections across the full effective depth
  • Per-virtual-layer scalar controls for attention, MLP, residual mixing, and quantization gain
  • FP16 tied embedding passthrough during int8 quantization
  • Overtone spectral embedding initialization
  • Phase-transition residual mixing initialization