PR #1096
openRecord: Depth-Recurrent UT + Rank-1 LoRA Per-Iteration Adaptation — val_bpb 1.3342
by vimetoView on GitHub
val_bpb
1.3342
Architecture
Transformer
Optimizer
Muon
Artifact Size
11.39 MB
Training Techniques
Architecture
depth recurrence
Universal Transformer-style recurrent depth with 1 prelude, 4 shared blocks repeated for 3 loops, and 1 coda.
parameters: {"effective_layers":14,"unique_blocks":6,"model_dim":640,"heads":10,"kv_heads":5,"head_dim":64,"mlp_multiplier":3}
weight tying
Tied embeddings are used.
parameters: null
LeakyReLU
Uses LeakyReLU(0.5) squared as the activation.
parameters: {"slope":0.5,"squared":true}
Quantization
QAT
bits: 6
scope: shared block weights
GPTQ
bits: 6
scope: shared block weights
QAT
bits: 6
scope: shared weights
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"ns5":true}
AdamW
weight_decay: null
momentum: null
other_params: {"used_for":"rank-1 LoRA vectors"}
Regularization
weight decay
parameters: {"value":0.04}
logit softcap
parameters: {"description":"capped timestep scaling / clamped scale vectors"}
layerwise LN scale
parameters: {"description":"Output-LN / Peri-LN on shared blocks"}
Other
other
Rank-1 LoRA per-iteration adaptation applied to Q, V, MLP-up, and MLP-down matrices with unique rank-1 deltas at each loop iteration.
parameters: {"rank":1,"targets":["Q","V","MLP-up","MLP-down"]}
other
Birkhoff-constrained residual mixing using sigmoid-gated mixing to keep spectral norm <= 1.
parameters: null
Novel Contributions
- Rank-1 LoRA per-iteration adaptation for stable recurrent transformer training
- Depth-recurrent Universal Transformer with 14 effective layers from 6 unique blocks
- Demonstration that rank-8 LoRA diverges under Muon scaling while rank-1 vectors on AdamW remain stable
- Combination of Output-LN, Birkhoff-constrained mixing, capped timestep scaling, and noisy QAT in a 640d recurrent model
- Achieves 1.3342 val_bpb with an 11.39 MB artifact