PR #1300
openNon-record: 1.8184 BPB Single-step Recurrent Transformer with Q-LoRA (Windows 3090)
by Ribin545View on GitHub
val_bpb
1.8184
Architecture
Transformer
Optimizer
Muon
Artifact Size
—
Training Techniques
Architecture
weight tying
Single recurrent block reused across depth with tied weights.
parameters: {"steps":1}
BigramHash
Enabled bigram hash feature path in the model.
parameters: {"size":2048,"scale":0.05}
GQA
Uses grouped query attention with fewer KV heads than query heads.
parameters: {"num_heads":8,"num_kv_heads":4}
LoRA
Per-step LoRA adapters on the recurrent block, scoped to q projections.
parameters: {"rank":512,"scope":"q"}
Regularization
logit softcap
parameters: null
Weight Averaging
EMA
parameters: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"dense_matrices":true}
AdamW
weight_decay: null
momentum: null
other_params: {"used_for":["scalar/control","LoRA","embeddings"]}
Evaluation
stride-based eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 1024
eval_length: null
LR Schedule
warmup
parameters: {"warmup_steps":12}
Initialization
OrthoInit
Orthogonal initialization was mentioned as a toggle, but disabled in the current profile.
Novel Contributions
- Single-step recurrent Transformer regime that replaced deeper UT-style recurrence under a 10-minute wallclock constraint.
- Deterministic Windows RTX 3090 reproducible setup with fixed data order and seed.
- Per-step LoRA adapter routing with q-only scope in a tied recurrent block.
- BigramHash feature path and architecture-aware optimizer routing.
- Documented BPB progression from UT baseline to a 1.8184 BPB reproducible checkpoint.