PR #1278

open

WIP: Depth Recurrence via Weight-Shared Transformer Blocks

val_bpb

1.1147

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

< 16 MB

Training Techniques

Quantization

GPTQ

bits: 6

scope: shared weights

Architecture

depth recurrence

Shares transformer block weights across multiple iterations to create more effective layers within the same parameter budget.

parameters: {"layers":4,"iterations":5}

weight tying

Uses weight-shared transformer blocks with tied parameters across recurrent depth iterations.

parameters: null

U-Net skip connections

Adapts U-Net-style skip connections for the recurrent transformer structure.

parameters: null

BigramHash

Uses BigramHash as part of the model stack.

parameters: null

XSA

Uses XSA as part of the model stack.

parameters: null

Partial RoPE

Retains Partial RoPE in the architecture stack.

parameters: null

VE128

Retains VE128 in the architecture stack.

parameters: null

SmearGate

Retains SmearGate in the architecture stack.

parameters: null

Optimizer

Parallel Muon

weight_decay: null

momentum: null

other_params: null

Weight-shared depth recurrence to achieve 20+ effective layers within the 16MB budget
Per-layer conditioning with layer index embeddings and learned scalar gates
Per-iteration RMSNorm for stabilizing deep recurrence
Adapted U-Net skip connections for recurrent transformer structure
Reallocation of parameter budget from unique layers to wider or more capable components