PR #1096

open

Record: Depth-Recurrent UT + Rank-1 LoRA Per-Iteration Adaptation — val_bpb 1.3342

by vimetoView on GitHub

val_bpb

1.3342

Architecture

Transformer

Optimizer

Muon

Artifact Size

11.39 MB

Training Techniques

Architecture

depth recurrence

Universal Transformer-style recurrent depth with 1 prelude, 4 shared blocks repeated for 3 loops, and 1 coda.

parameters: {"effective_layers":14,"unique_blocks":6,"model_dim":640,"heads":10,"kv_heads":5,"head_dim":64,"mlp_multiplier":3}

weight tying

Tied embeddings are used.

parameters: null

LeakyReLU

Uses LeakyReLU(0.5) squared as the activation.

parameters: {"slope":0.5,"squared":true}

Quantization

QAT

bits: 6

scope: shared block weights

GPTQ

bits: 6

scope: shared block weights

QAT

bits: 6

scope: shared weights

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"ns5":true}

AdamW

weight_decay: null

momentum: null

other_params: {"used_for":"rank-1 LoRA vectors"}

Regularization

weight decay

parameters: {"value":0.04}

logit softcap

parameters: {"description":"capped timestep scaling / clamped scale vectors"}

layerwise LN scale

parameters: {"description":"Output-LN / Peri-LN on shared blocks"}

Other

other

Rank-1 LoRA per-iteration adaptation applied to Q, V, MLP-up, and MLP-down matrices with unique rank-1 deltas at each loop iteration.

parameters: {"rank":1,"targets":["Q","V","MLP-up","MLP-down"]}

other

Birkhoff-constrained residual mixing using sigmoid-gated mixing to keep spectral norm <= 1.

parameters: null

Novel Contributions

Rank-1 LoRA per-iteration adaptation for stable recurrent transformer training
Depth-recurrent Universal Transformer with 14 effective layers from 6 unique blocks
Demonstration that rank-8 LoRA diverges under Muon scaling while rank-1 vectors on AdamW remain stable
Combination of Output-LN, Birkhoff-constrained mixing, capped timestep scaling, and noisy QAT in a 640d recurrent model
Achieves 1.3342 val_bpb with an 11.39 MB artifact