PR #1299

open

Non-record: 1.8184 BPB Single-step Recurrent Transformer with Q-LoRA (Windows 3090)

by Ribin545View on GitHub

val_bpb

1.8184

Architecture

Transformer

Optimizer

Muon

Artifact Size

—

Training Techniques

Architecture

depth recurrence

Evolved from a Universal Transformer with tied recurrence into a single-step recurrent transformer regime.

parameters: {"steps":1}

weight tying

Universal Transformer-style tied recurrence / reused block structure.

parameters: null

LeakyReLU

Uses a LeakyReLU-based MLP activation path, described as X × W + LeakyReLU^2 in the README.

parameters: null

coordinate embeddings

Adds step/coordinate embeddings to the recursive block.

parameters: null

MLP activation

Fused Triton MLP using X × W + LeakyReLU^2.

parameters: null

Quantization

Q-LoRA

bits: null

scope: q

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"lr":0.009,"backend_steps":5}

LR Schedule

warmup

parameters: null

Regularization

gradient clipping

parameters: null

logit softcap

parameters: null

Other

other

Strict pre-normalization with RMSNorm removed from the residual path to support deep state accumulation.

parameters: null

other

Universal Gradient Averaging with a 1/12 scaling factor and a 20-step maturity ramp to stabilize recursion.

parameters: {"gradient_averaging":0.08333333333333333,"maturity_ramp_steps":20}

other

Fused Triton MLP kernel for improved Windows throughput.

parameters: null

Sequence Length

sequence_length

train_length: 524288

eval_length: null

Novel Contributions

Single-step recurrent transformer direction replacing depth-heavy UT behavior
Q-LoRA on q projections
Strict pre-normalization with residual-path RMSNorm removal
Universal Gradient Averaging with maturity ramp
Fused Triton MLP kernel for Windows RTX 3090 throughput
Deterministic data path for reproducible runs