PR #855

open

Non-Record: First Viable 3-Loop Recurrence — Birkhoff + Output-LN + Timestep Scaling (val_bpb=1.2659, 14 eff layers from 6 unique blocks)

val_bpb

1.2659

Architecture

Transformer

Optimizer

—

Artifact Size

10.7 MB

Training Techniques

Architecture

depth recurrence

Uses 1 prelude + 4 shared blocks repeated for 3 loops + 1 coda, yielding 14 effective layers from 6 unique blocks.

parameters: {"prelude":1,"shared_blocks":4,"loops":3,"coda":1}

Output-LN

Moves RMSNorm from MLP input to MLP output so shared weights can distinguish loop iterations by magnitude.

parameters: null

Birkhoff mixing

Replaces learned residual mixing with a sigmoid-constrained convex combination to keep spectral norm <= 1.

parameters: null

timestep scaling

Per-iteration learned scale vectors applied across loops, capped to a fixed range.

parameters: {"cap":4}

Regularization

layerwise LN scale

parameters: null

Quantization

int8

bits: 8

scope: model weights with float16 passthrough for timestep gammas

Other

other

LeakyReLU(0.5)^2 activation used to preserve negative signal through quadratic activation.

parameters: {"negative_slope":0.5}

First viable 3-loop recurrence in the competition
Output-LN to prevent magnitude erasure across recurrent iterations
Birkhoff-constrained residual mixing to stabilize recurrence and limit spectral blowup
Capped timestep scaling with float16 passthrough to reduce quantization gap
Demonstration that these techniques can reduce catastrophic quantization amplification in recurrent depth models