PR #2114
openNon-record: UT7 delta-residual + RLMA-rank320 — documented failed direction (val_bpb 1.29740, 2-seed 1xH100)
by SacmajView on GitHub
val_bpb
1.2974
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,842,747 bytes
Training Techniques
Architecture
depth recurrence
Universal Transformer with a shared recurrent block unrolled for K iterations, plus unique input and output blocks.
parameters: {"K_iters":6}
weight tying
Tied embeddings are used.
parameters: null
GQA
Grouped query attention with fewer KV heads than query heads.
parameters: {"num_heads":8,"num_kv_heads":4,"head_dim":128}
MLP3x
Feed-forward dimension is 3x the model dimension.
parameters: {"model_dim":1024,"d_ff":3072}
RLMA
Low-rank adapter matrices over deterministic random bases, regenerated per layer/iteration.
parameters: {"adapter_rank":320}
residual delta
Shared block uses additive delta residual update instead of state contraction.
parameters: {"branch_scale_init":0.6}
Quantization
GPTQ
bits: 8
scope: adapter matrices and embeddings
Compression
zstd
level: 22
Test-Time Training
TTT
parameters: {"rank":288}
Sequence Length
sequence_length
train_length: 8192
eval_length: 8192
LR Schedule
warmdown
parameters: {"warmdown_iters":600}
Regularization
logit softcap
parameters: {"value":30}
gradient clipping
parameters: {"norm":0.2}
Other
other
Deterministic random-base adapters regenerate a random projection from a hash of layer name and iteration index, reducing stored parameters.
parameters: null
Novel Contributions
- Documented failed UT + RLMA direction with reusable negative results rather than a competitive record submission
- Delta-residual update on the shared Universal Transformer block improved optimization versus recurrent state contraction
- Direct ablation showed in-model TTT was slower and worse at this scale, so it was removed
- Warmdown schedule bug was diagnosed and fixed so LR decay engages even when iteration cap binds first
- Cap-hardening sweep selected a GPTQ clipping threshold that balanced artifact size and validation performance