PR #298

open

Ultimate recurrent: 21 techniques — depth recurrence, novel ops

by MrINVISOView on GitHub
val_bpb
1.2271
Architecture
Depth-recurrent transformer
Optimizer
Muon
Artifact Size
10.7MB

Training Techniques

Architecture
depth recurrence
3 unique layers shared across 3 passes for effective depth 9.
parameters: {"unique_layers":3,"passes":3,"effective_depth":9}
Transformer
Wider model dimension than baseline.
parameters: {"dim":768}
GQA
Grouped-query attention with 8 query heads and 2 key/value heads.
parameters: {"q_heads":8,"kv_heads":2}
RoPE
Rotary positional embeddings with a larger base.
parameters: {"base":500000}
U-Net skip connections
Skip connections across recurrent passes/layers.
parameters: null
low-rank K projection
Reduced-rank key projection to save parameters.
parameters: {"rank":32}
low-rank TD projection
Reduced-rank temporal-difference projection to save parameters.
parameters: {"rank":16}
low-rank GRU state carry
Reduced-rank GRU state carry to save parameters.
parameters: {"rank":16}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Evaluation
sliding window eval
parameters: {"stride":64}
Initialization
spectral init
Spectral embedding initialization with std = 0.1 / sqrt(dim).
Optimizer
Muon
weight_decay: 0.01
momentum: null
other_params: null
Other
other
Value embeddings.
parameters: null
other
Per-pass control parameters for attention scale, MLP scale, and residual mixing.
parameters: null
other
Adaptive depth with an exit gate per token per pass.
parameters: null
other
Confidence conditioning across passes.
parameters: null
other
Gradient Memory Recurrence.
parameters: null
other
Thermodynamic Compression Loss (F = E - T*S).
parameters: null
other
Temporal Difference Recurrence with low-rank rank-16 projection.
parameters: {"rank":16}
other
Eigenspace Token Routing.
parameters: null
other
Resonant Position Encoding.
parameters: null
other
Selective State GRU Carry with low-rank rank-16 projection.
parameters: {"rank":16}
Regularization
compression-aware auxiliary loss
parameters: null

Novel Contributions

  • Depth recurrence with 3 unique layers shared across 3 passes (effective depth 9)
  • Novel recurrent mechanisms including gradient memory recurrence, temporal difference recurrence, and selective state GRU carry
  • Thermodynamic compression loss
  • Eigenspace token routing
  • Resonant position encoding
  • Adaptive depth with per-token exit gating
  • Confidence conditioning across passes
  • Low-rank projections to reduce parameter count