PR #855
openNon-Record: First Viable 3-Loop Recurrence — Birkhoff + Output-LN + Timestep Scaling (val_bpb=1.2659, 14 eff layers from 6 unique blocks)
by aazizyanView on GitHub
val_bpb
1.2659
Architecture
Transformer
Optimizer
—
Artifact Size
10.7 MB
Training Techniques
Architecture
depth recurrence
Uses 1 prelude + 4 shared blocks repeated for 3 loops + 1 coda, yielding 14 effective layers from 6 unique blocks.
parameters: {"prelude":1,"shared_blocks":4,"loops":3,"coda":1}
Output-LN
Moves RMSNorm from MLP input to MLP output so shared weights can distinguish loop iterations by magnitude.
parameters: null
Birkhoff mixing
Replaces learned residual mixing with a sigmoid-constrained convex combination to keep spectral norm <= 1.
parameters: null
timestep scaling
Per-iteration learned scale vectors applied across loops, capped to a fixed range.
parameters: {"cap":4}
Regularization
layerwise LN scale
parameters: null
Quantization
int8
bits: 8
scope: model weights with float16 passthrough for timestep gammas
Other
other
LeakyReLU(0.5)^2 activation used to preserve negative signal through quadratic activation.
parameters: {"negative_slope":0.5}
Novel Contributions
- First viable 3-loop recurrence in the competition
- Output-LN to prevent magnitude erasure across recurrent iterations
- Birkhoff-constrained residual mixing to stabilize recurrence and limit spectral blowup
- Capped timestep scaling with float16 passthrough to reduce quantization gap
- Demonstration that these techniques can reduce catastrophic quantization amplification in recurrent depth models