PR #2137

open

Non-record: notes on the recurrence band (mixing parameters, MLP sizing, loop sizing)

by leon2k2k2kView on GitHub
val_bpb
1.0636
Architecture
Transformer
Optimizer
Artifact Size
16 MB

Training Techniques

Architecture
depth recurrence
Uses a recurrent loop band over layers 3-5 with repeated passes through the same block stack.
parameters: {"layers":[3,4,5],"NL":2}
depth recurrence
Learns and then freezes recurrent alpha-beta mixing coefficients for the loop band, including cross-layer carries.
parameters: {"layers":[3,4,5]}
depth recurrence
Uses frozen Anderson acceleration coefficients as a fixed-coefficient extrapolation over the last three iterates.
parameters: {"m":3}
MLP3x
Reallocates FFN width across early, middle, and late stages to widen the loop band while keeping total parameters fixed.
parameters: {"early_layers":[0,1,2],"middle_layers":[3,4,5],"late_layers":[6,7,8,9,10]}
depth recurrence
Varies the loop band set and number of loop visits per forward pass to study band sizing.
parameters: {"band_set":[3,4,5],"NL":[1,2,3,4]}
Weight Averaging
EMA
parameters: null

Novel Contributions

  • Learned recurrent alpha-beta mixing in the loop band and then froze the converged coefficients for a fresh run.
  • Applied frozen Anderson acceleration coefficients to remove per-batch least-squares overhead.
  • Explored FFN width reallocation across early, middle, and late stages while keeping total parameters fixed.
  • Screened loop-band size and repetition count, showing the canonical {3,4,5} NL=2 configuration is locally optimal in the tested range.
  • Observed that the loop band is not obviously starved for FFN capacity at the uniform baseline.