PR #855

open

Non-Record: First Viable 3-Loop Recurrence — Birkhoff + Output-LN + Timestep Scaling (val_bpb=1.2659, 14 eff layers from 6 unique blocks)

by aazizyanView on GitHub
val_bpb
1.2659
Architecture
Transformer
Optimizer
Artifact Size
10.7 MB

Training Techniques

Architecture
depth recurrence
Uses 1 prelude + 4 shared blocks repeated for 3 loops + 1 coda, yielding 14 effective layers from 6 unique blocks.
parameters: {"prelude":1,"shared_blocks":4,"loops":3,"coda":1}
Output-LN
Moves RMSNorm from MLP input to MLP output so shared weights can distinguish loop iterations by magnitude.
parameters: null
Birkhoff mixing
Replaces learned residual mixing with a sigmoid-constrained convex combination to keep spectral norm <= 1.
parameters: null
timestep scaling
Per-iteration learned scale vectors applied across loops, capped to a fixed range.
parameters: {"cap":4}
Regularization
layerwise LN scale
parameters: null
Quantization
int8
bits: 8
scope: model weights with float16 passthrough for timestep gammas
Other
other
LeakyReLU(0.5)^2 activation used to preserve negative signal through quadratic activation.
parameters: {"negative_slope":0.5}

Novel Contributions

  • First viable 3-loop recurrence in the competition
  • Output-LN to prevent magnitude erasure across recurrent iterations
  • Birkhoff-constrained residual mixing to stabilize recurrence and limit spectral blowup
  • Capped timestep scaling with float16 passthrough to reduce quantization gap
  • Demonstration that these techniques can reduce catastrophic quantization amplification in recurrent depth models