PR #1231
openNon-record: Stable Growing Recurrence, Progressive Depth + Error Feedback
by nestamidavaineView on GitHub
val_bpb
1.1163
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
~15.96 MB
Training Techniques
Architecture
depth recurrence
Shared transformer core reused across multiple passes with progressive growth from 1 to 3 passes during training and evaluation.
parameters: {"layers":11,"effective_layers_eval":17,"passes":[1,2,3],"core_layers":[4,5,6]}
LeakyReLU
Uses LeakyReLU squared MLP activation.
parameters: {"variant":"LeakyReLU(0.5)^2"}
BigramHash
Bigram hashing component for token representation.
parameters: {"size":512}
XSA
XSA applied to the last 4 layers.
parameters: {"layers":4}
Partial RoPE
Rotary position embeddings applied partially.
parameters: {"dimensions":"16/64"}
VE128
VE128 enabled on selected layers.
parameters: {"layers":[9,10]}
ResidualScale
Per-pass learnable residual scaling to stabilize recurrent dynamics.
parameters: {"init":0.5}
error feedback
Low-rank diagonal error feedback correction before each recurrent pass.
parameters: {"rank":2,"params":2560}
Regularization
layerwise LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
Jacobian proxy loss
parameters: {"lambda":0.01}
Quantization
late QAT
bits: 6
scope: all
GPTQ-lite
bits: 6
scope: all
Weight Averaging
EMA + SWA
parameters: {"ema_decay":0.997,"swa_every":50}
Optimizer
SGD
weight_decay: null
momentum: 0.9
other_params: {"used_for_ttt":true}
Test-Time Training
score-first TTT
parameters: {"chunk_size":32768,"epochs":3,"learning_rate":0.002,"gradient_clip":1,"eval_passes":3}
Evaluation
sliding window eval
parameters: {"inference_mode":true}
Compression
lzma
level: null
LR Schedule
cosine decay
parameters: {"used_for_ttt":true}
Novel Contributions
- Progressive recurrence depth growth from 1 to 3 passes during training
- Learnable residual scaling to make recurrent passes contractive
- Low-rank error feedback module to correct accumulated recurrence error
- Jacobian proxy loss to stabilize hidden-state growth without full Jacobian computation
- Warmup precompilation of all pass/QAT graph variants to avoid compile stalls
- Legal score-first TTT protocol with sliding-window evaluation and post-score adaptation