PR #298

open

Ultimate recurrent: 21 techniques — depth recurrence, novel ops

by MrINVISOView on GitHub

val_bpb

1.2271

Architecture

Depth-recurrent transformer

Optimizer

Muon

Artifact Size

10.7MB

Training Techniques

Architecture

depth recurrence

3 unique layers shared across 3 passes for effective depth 9.

parameters: {"unique_layers":3,"passes":3,"effective_depth":9}

Transformer

Wider model dimension than baseline.

parameters: {"dim":768}

GQA

Grouped-query attention with 8 query heads and 2 key/value heads.

parameters: {"q_heads":8,"kv_heads":2}

RoPE

Rotary positional embeddings with a larger base.

parameters: {"base":500000}

U-Net skip connections

Skip connections across recurrent passes/layers.

parameters: null

low-rank K projection

Reduced-rank key projection to save parameters.

parameters: {"rank":32}

low-rank TD projection

Reduced-rank temporal-difference projection to save parameters.

parameters: {"rank":16}

low-rank GRU state carry

Reduced-rank GRU state carry to save parameters.

parameters: {"rank":16}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

Evaluation

sliding window eval

parameters: {"stride":64}

Initialization

spectral init

Spectral embedding initialization with std = 0.1 / sqrt(dim).

Optimizer

Muon

weight_decay: 0.01

momentum: null

other_params: null

Other

other

Value embeddings.

parameters: null

other

Per-pass control parameters for attention scale, MLP scale, and residual mixing.

parameters: null

other

Adaptive depth with an exit gate per token per pass.

parameters: null

other

Confidence conditioning across passes.

parameters: null

other

Gradient Memory Recurrence.

parameters: null

other

Thermodynamic Compression Loss (F = E - T*S).

parameters: null

other

Temporal Difference Recurrence with low-rank rank-16 projection.

parameters: {"rank":16}

other

Eigenspace Token Routing.

parameters: null

other

Resonant Position Encoding.

parameters: null

other

Selective State GRU Carry with low-rank rank-16 projection.

parameters: {"rank":16}

Regularization

compression-aware auxiliary loss

parameters: null

Novel Contributions

Depth recurrence with 3 unique layers shared across 3 passes (effective depth 9)
Novel recurrent mechanisms including gradient memory recurrence, temporal difference recurrence, and selective state GRU carry
Thermodynamic compression loss
Eigenspace token routing
Resonant position encoding
Adaptive depth with per-token exit gating
Confidence conditioning across passes
Low-rank projections to reduce parameter count