PR #1278

open

WIP: Depth Recurrence via Weight-Shared Transformer Blocks

by GitGeeksView on GitHub
val_bpb
1.1147
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
< 16 MB

Training Techniques

Quantization
GPTQ
bits: 6
scope: shared weights
Architecture
depth recurrence
Shares transformer block weights across multiple iterations to create more effective layers within the same parameter budget.
parameters: {"layers":4,"iterations":5}
weight tying
Uses weight-shared transformer blocks with tied parameters across recurrent depth iterations.
parameters: null
U-Net skip connections
Adapts U-Net-style skip connections for the recurrent transformer structure.
parameters: null
BigramHash
Uses BigramHash as part of the model stack.
parameters: null
XSA
Uses XSA as part of the model stack.
parameters: null
Partial RoPE
Retains Partial RoPE in the architecture stack.
parameters: null
VE128
Retains VE128 in the architecture stack.
parameters: null
SmearGate
Retains SmearGate in the architecture stack.
parameters: null
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: null

Novel Contributions

  • Weight-shared depth recurrence to achieve 20+ effective layers within the 16MB budget
  • Per-layer conditioning with layer index embeddings and learned scalar gates
  • Per-iteration RMSNorm for stabilizing deep recurrence
  • Adapted U-Net skip connections for recurrent transformer structure
  • Reallocation of parameter budget from unique layers to wider or more capable components