PR #2085

open

Create README for VRL Revival and Extended Muon Warmup

by umshahidView on GitHub
val_bpb
1.0857
Architecture
Transformer
Optimizer
Muon
Artifact Size
~16.0 MB

Training Techniques

Architecture
Value Residual
Re-introduces value residual learning by injecting the first encoder block's value tensor into later blocks via learnable per-block lambdas.
parameters: {"lambda_init":[1,0]}
depth recurrence
Uses recurrent encoder/decoder layer reuse with an activated recurrence schedule.
parameters: {"encoder":[0,1,2,3,4,5,3,4],"decoder":[5,3,4,5,6,7,8,9,10],"activated_frac":0.35}
U-Net skip connections
Adds skip gates / U-Net-style skip connections from later layers.
parameters: {"from_layer":7}
Partial RoPE
Applies partial rotary positional embeddings.
parameters: {"rotary_fraction":"16/64"}
LeakyReLU
Uses LeakyReLU activation in the MLP.
parameters: {"slope":0.5}
weight tying
Ties input embeddings and output embeddings.
parameters: null
Optimizer
Muon
weight_decay: 0.095
momentum: 0.99
other_params: {"muon_momentum_warmup_steps":2000}
LR Schedule
warmdown
parameters: {"linear_warmdown":0.72}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Regularization
layerwise LN scale
parameters: null
logit softcap
parameters: {"value":30}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.005,"momentum":0.9,"epochs":3}
Evaluation
sliding window eval
parameters: null
Quantization
GPTQ
bits: 6
scope: matrices
GPTQ
bits: 8
scope: token embeddings
Compression
lzma
level: null

Novel Contributions

  • Re-introduced Value Residual Learning (VRL) on top of the bigbag PR #1493 stack
  • Extended Muon momentum warmup from 1500 to 2000 steps
  • Reported consistent BPB improvement from VRL and a smaller additional gain from longer momentum warmup
  • Documented reproduction steps and seed-wise results, including a near-limit artifact size case