PR #1230

closed

Stable Growing Recurrence: Progressive Depth + Error Feedback (non-record)

by nestamidavaineView on GitHub
val_bpb
1.1163
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
~15.96 MB

Training Techniques

Architecture
depth recurrence
Shared recurrent core with progressive pass growth during training and evaluation.
parameters: {"stem_layers":4,"core_layers":3,"tail_layers":4,"passes":3}
LeakyReLU
Uses LeakyReLU(0.5)^2 activation in the MLP.
parameters: {"slope":0.5}
BigramHash
Bigram hash embedding for token representation.
parameters: {"vocab_size":512}
XSA
Cross/self-attention style skip mechanism applied to the last layers.
parameters: {"last_n_layers":4}
Partial RoPE
Applies rotary position embeddings to only part of the head dimensions.
parameters: {"dimensions":16}
VE128
Value residual enhancement on selected layers.
parameters: {"layers":[9,10]}
error feedback
Low-rank correction branch injected before recurrent passes to compensate accumulated quantization error.
parameters: {"rank":2}
Quantization
late QAT
bits: 6
scope: core weights
GPTQ-lite
bits: 6
scope: export
Regularization
LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
Jacobian proxy loss
parameters: {"lambda":0.01}
Weight Averaging
EMA + SWA
parameters: {"ema_decay":0.997,"swa_every":50}
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"chunk_tokens":32768,"epochs":3,"learning_rate":0.002,"optimizer":"SGD + momentum","momentum":0.9}
LR Schedule
cosine decay
parameters: {"used_for":"TTT"}

Novel Contributions

  • Progressive recurrence depth growth during training (1 pass to 2 passes to 3 passes).
  • Learnable residual scaling to make recurrent dynamics contractive.
  • Low-rank error feedback branch to compensate quantization error across recurrent passes.
  • Jacobian proxy loss to stabilize hidden-state growth without full Jacobian computation.
  • Warmup precompilation of all pass/QAT graph variants to avoid compile stalls.
  • Legal score-first TTT adapted to the recurrent architecture.