PR #1231

open

Non-record: Stable Growing Recurrence, Progressive Depth + Error Feedback

by nestamidavaineView on GitHub

val_bpb

1.1163

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

~15.96 MB

Training Techniques

Architecture

depth recurrence

Shared transformer core reused across multiple passes with progressive growth from 1 to 3 passes during training and evaluation.

parameters: {"layers":11,"effective_layers_eval":17,"passes":[1,2,3],"core_layers":[4,5,6]}

LeakyReLU

Uses LeakyReLU squared MLP activation.

parameters: {"variant":"LeakyReLU(0.5)^2"}

BigramHash

Bigram hashing component for token representation.

parameters: {"size":512}

XSA

XSA applied to the last 4 layers.

parameters: {"layers":4}

Partial RoPE

Rotary position embeddings applied partially.

parameters: {"dimensions":"16/64"}

VE128

VE128 enabled on selected layers.

parameters: {"layers":[9,10]}

ResidualScale

Per-pass learnable residual scaling to stabilize recurrent dynamics.

parameters: {"init":0.5}

error feedback

Low-rank diagonal error feedback correction before each recurrent pass.

parameters: {"rank":2,"params":2560}

Regularization

layerwise LN scale

parameters: {"scale":"1/sqrt(layer+1)"}

Jacobian proxy loss

parameters: {"lambda":0.01}

Quantization

late QAT

bits: 6

scope: all

GPTQ-lite

bits: 6

scope: all

Weight Averaging

EMA + SWA

parameters: {"ema_decay":0.997,"swa_every":50}

Optimizer

SGD

weight_decay: null

momentum: 0.9

other_params: {"used_for_ttt":true}

Test-Time Training

score-first TTT

parameters: {"chunk_size":32768,"epochs":3,"learning_rate":0.002,"gradient_clip":1,"eval_passes":3}

Evaluation

sliding window eval

parameters: {"inference_mode":true}

Compression

lzma

level: null

LR Schedule

cosine decay

parameters: {"used_for_ttt":true}

Novel Contributions

Progressive recurrence depth growth from 1 to 3 passes during training
Learnable residual scaling to make recurrent passes contractive
Low-rank error feedback module to correct accumulated recurrence error
Jacobian proxy loss to stabilize hidden-state growth without full Jacobian computation
Warmup precompilation of all pass/QAT graph variants to avoid compile stalls
Legal score-first TTT protocol with sliding-window evaluation and post-score adaptation