PR #1299

open

Non-record: 1.8184 BPB Single-step Recurrent Transformer with Q-LoRA (Windows 3090)

by Ribin545View on GitHub
val_bpb
1.8184
Architecture
Transformer
Optimizer
Muon
Artifact Size

Training Techniques

Architecture
depth recurrence
Evolved from a Universal Transformer with tied recurrence into a single-step recurrent transformer regime.
parameters: {"steps":1}
weight tying
Universal Transformer-style tied recurrence / reused block structure.
parameters: null
LeakyReLU
Uses a LeakyReLU-based MLP activation path, described as X × W + LeakyReLU^2 in the README.
parameters: null
coordinate embeddings
Adds step/coordinate embeddings to the recursive block.
parameters: null
MLP activation
Fused Triton MLP using X × W + LeakyReLU^2.
parameters: null
Quantization
Q-LoRA
bits: null
scope: q
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"lr":0.009,"backend_steps":5}
LR Schedule
warmup
parameters: null
Regularization
gradient clipping
parameters: null
logit softcap
parameters: null
Other
other
Strict pre-normalization with RMSNorm removed from the residual path to support deep state accumulation.
parameters: null
other
Universal Gradient Averaging with a 1/12 scaling factor and a 20-step maturity ramp to stabilize recursion.
parameters: {"gradient_averaging":0.08333333333333333,"maturity_ramp_steps":20}
other
Fused Triton MLP kernel for improved Windows throughput.
parameters: null
Sequence Length
sequence_length
train_length: 524288
eval_length: null

Novel Contributions

  • Single-step recurrent transformer direction replacing depth-heavy UT behavior
  • Q-LoRA on q projections
  • Strict pre-normalization with residual-path RMSNorm removal
  • Universal Gradient Averaging with maturity ramp
  • Fused Triton MLP kernel for Windows RTX 3090 throughput
  • Deterministic data path for reproducible runs