PR #1299
openNon-record: 1.8184 BPB Single-step Recurrent Transformer with Q-LoRA (Windows 3090)
by Ribin545View on GitHub
val_bpb
1.8184
Architecture
Transformer
Optimizer
Muon
Artifact Size
—
Training Techniques
Architecture
depth recurrence
Evolved from a Universal Transformer with tied recurrence into a single-step recurrent transformer regime.
parameters: {"steps":1}
weight tying
Universal Transformer-style tied recurrence / reused block structure.
parameters: null
LeakyReLU
Uses a LeakyReLU-based MLP activation path, described as X × W + LeakyReLU^2 in the README.
parameters: null
coordinate embeddings
Adds step/coordinate embeddings to the recursive block.
parameters: null
MLP activation
Fused Triton MLP using X × W + LeakyReLU^2.
parameters: null
Quantization
Q-LoRA
bits: null
scope: q
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"lr":0.009,"backend_steps":5}
LR Schedule
warmup
parameters: null
Regularization
gradient clipping
parameters: null
logit softcap
parameters: null
Other
other
Strict pre-normalization with RMSNorm removed from the residual path to support deep state accumulation.
parameters: null
other
Universal Gradient Averaging with a 1/12 scaling factor and a 20-step maturity ramp to stabilize recursion.
parameters: {"gradient_averaging":0.08333333333333333,"maturity_ramp_steps":20}
other
Fused Triton MLP kernel for improved Windows throughput.
parameters: null
Sequence Length
sequence_length
train_length: 524288
eval_length: null
Novel Contributions
- Single-step recurrent transformer direction replacing depth-heavy UT behavior
- Q-LoRA on q projections
- Strict pre-normalization with residual-path RMSNorm removal
- Universal Gradient Averaging with maturity ramp
- Fused Triton MLP kernel for Windows RTX 3090 throughput
- Deterministic data path for reproducible runs