val_bpb
1.1147
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
< 16 MB
Training Techniques
Quantization
GPTQ
bits: 6
scope: shared weights
Architecture
depth recurrence
Shares transformer block weights across multiple iterations to create more effective layers within the same parameter budget.
parameters: {"layers":4,"iterations":5}
weight tying
Uses weight-shared transformer blocks with tied parameters across recurrent depth iterations.
parameters: null
U-Net skip connections
Adapts U-Net-style skip connections for the recurrent transformer structure.
parameters: null
BigramHash
Uses BigramHash as part of the model stack.
parameters: null
XSA
Uses XSA as part of the model stack.
parameters: null
Partial RoPE
Retains Partial RoPE in the architecture stack.
parameters: null
VE128
Retains VE128 in the architecture stack.
parameters: null
SmearGate
Retains SmearGate in the architecture stack.
parameters: null
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: null
Novel Contributions
- Weight-shared depth recurrence to achieve 20+ effective layers within the 16MB budget
- Per-layer conditioning with layer index embeddings and learned scalar gates
- Per-iteration RMSNorm for stabilizing deep recurrence
- Adapted U-Net skip connections for recurrent transformer structure
- Reallocation of parameter budget from unique layers to wider or more capable components