val_bpb
1.2271
Architecture
Depth-recurrent transformer
Optimizer
Muon
Artifact Size
10.7MB
Training Techniques
Architecture
depth recurrence
3 unique layers shared across 3 passes for effective depth 9.
parameters: {"unique_layers":3,"passes":3,"effective_depth":9}
Transformer
Wider model dimension than baseline.
parameters: {"dim":768}
GQA
Grouped-query attention with 8 query heads and 2 key/value heads.
parameters: {"q_heads":8,"kv_heads":2}
RoPE
Rotary positional embeddings with a larger base.
parameters: {"base":500000}
U-Net skip connections
Skip connections across recurrent passes/layers.
parameters: null
low-rank K projection
Reduced-rank key projection to save parameters.
parameters: {"rank":32}
low-rank TD projection
Reduced-rank temporal-difference projection to save parameters.
parameters: {"rank":16}
low-rank GRU state carry
Reduced-rank GRU state carry to save parameters.
parameters: {"rank":16}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Evaluation
sliding window eval
parameters: {"stride":64}
Initialization
spectral init
Spectral embedding initialization with std = 0.1 / sqrt(dim).
Optimizer
Muon
weight_decay: 0.01
momentum: null
other_params: null
Other
other
Value embeddings.
parameters: null
other
Per-pass control parameters for attention scale, MLP scale, and residual mixing.
parameters: null
other
Adaptive depth with an exit gate per token per pass.
parameters: null
other
Confidence conditioning across passes.
parameters: null
other
Gradient Memory Recurrence.
parameters: null
other
Thermodynamic Compression Loss (F = E - T*S).
parameters: null
other
Temporal Difference Recurrence with low-rank rank-16 projection.
parameters: {"rank":16}
other
Eigenspace Token Routing.
parameters: null
other
Resonant Position Encoding.
parameters: null
other
Selective State GRU Carry with low-rank rank-16 projection.
parameters: {"rank":16}
Regularization
compression-aware auxiliary loss
parameters: null
Novel Contributions
- Depth recurrence with 3 unique layers shared across 3 passes (effective depth 9)
- Novel recurrent mechanisms including gradient memory recurrence, temporal difference recurrence, and selective state GRU carry
- Thermodynamic compression loss
- Eigenspace token routing
- Resonant position encoding
- Adaptive depth with per-token exit gating
- Confidence conditioning across passes
- Low-rank projections to reduce parameter count