PR #1300

open

Non-record: 1.8184 BPB Single-step Recurrent Transformer with Q-LoRA (Windows 3090)

by Ribin545View on GitHub

val_bpb

1.8184

Architecture

Transformer

Optimizer

Muon

Artifact Size

—

Training Techniques

Architecture

weight tying

Single recurrent block reused across depth with tied weights.

parameters: {"steps":1}

BigramHash

Enabled bigram hash feature path in the model.

parameters: {"size":2048,"scale":0.05}

GQA

Uses grouped query attention with fewer KV heads than query heads.

parameters: {"num_heads":8,"num_kv_heads":4}

LoRA

Per-step LoRA adapters on the recurrent block, scoped to q projections.

parameters: {"rank":512,"scope":"q"}

Regularization

logit softcap

parameters: null

Weight Averaging

EMA

parameters: null

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"dense_matrices":true}

AdamW

weight_decay: null

momentum: null

other_params: {"used_for":["scalar/control","LoRA","embeddings"]}

Evaluation

stride-based eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: 1024

eval_length: null

LR Schedule

warmup

parameters: {"warmup_steps":12}

Initialization

OrthoInit

Orthogonal initialization was mentioned as a toggle, but disabled in the current profile.

Novel Contributions

Single-step recurrent Transformer regime that replaced deeper UT-style recurrence under a 10-minute wallclock constraint.
Deterministic Windows RTX 3090 reproducible setup with fixed data order and seed.
Per-step LoRA adapter routing with q-only scope in a tied recurrent block.
BigramHash feature path and architecture-aware optimizer routing.
Documented BPB progression from UT baseline to a 1.8184 BPB reproducible checkpoint.