PR #319

open

Non-record: Depth Recurrence 5x3 — Weight-Shared Looping Transformer (6xH200, val_bpb=1.2716)

by Arth-SinghView on GitHub
val_bpb
1.2716
Architecture
Transformer
Optimizer
Muon + Adam
Artifact Size
15M params

Training Techniques

Architecture
depth recurrence
Uses 5 unique transformer layers looped 3 times to create 15 effective layers while sharing weights across loops.
parameters: {"unique_layers":5,"loops":3,"effective_depth":15,"dim":640}
weight tying
Shares the same transformer block weights across repeated loop passes.
parameters: {"unique_layers":5,"loops":3}
loop embeddings
Adds learnable per-loop vectors to the residual stream so the model can distinguish different passes through the shared layers.
parameters: {"num_loops":3}
loop gates
Uses learnable per-loop scalars to mix loop output with the initial representation x0; noted as over-regularized.
parameters: {"num_loops":3,"initial_gate":0.3333333333333333}
KV head count
Uses grouped-query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
Optimizer
Muon + Adam
weight_decay: null
momentum: null
other_params: {"split_optimizer":true}
Initialization
zero initialization for loop embeddings
Loop embeddings were initialized at zero so the first loop behaves like vanilla.
uniform gate initialization
Loop gates were initialized to 1/num_loops (0.33), which the author identifies as over-regularizing.

Novel Contributions

  • Depth recurrence via looping a small set of shared transformer layers to achieve greater effective depth.
  • Weight-shared looping transformer with 5 unique layers repeated 3 times.
  • Loop embeddings to differentiate repeated passes through shared weights.
  • Loop gates to mix loop outputs with the initial residual stream.
  • Exploration of a depth-width tradeoff by reallocating saved parameters to wider hidden dimension.
  • Negative finding that conservative loop gating and removing skip connections hurt performance.