PR #2085

open

Create README for VRL Revival and Extended Muon Warmup

by umshahidView on GitHub

val_bpb

1.0857

Architecture

Transformer

Optimizer

Muon

Artifact Size

~16.0 MB

Training Techniques

Architecture

Value Residual

Re-introduces value residual learning by injecting the first encoder block's value tensor into later blocks via learnable per-block lambdas.

parameters: {"lambda_init":[1,0]}

depth recurrence

Uses recurrent encoder/decoder layer reuse with an activated recurrence schedule.

parameters: {"encoder":[0,1,2,3,4,5,3,4],"decoder":[5,3,4,5,6,7,8,9,10],"activated_frac":0.35}

U-Net skip connections

Adds skip gates / U-Net-style skip connections from later layers.

parameters: {"from_layer":7}

Partial RoPE

Applies partial rotary positional embeddings.

parameters: {"rotary_fraction":"16/64"}

LeakyReLU

Uses LeakyReLU activation in the MLP.

parameters: {"slope":0.5}

weight tying

Ties input embeddings and output embeddings.

parameters: null

Optimizer

Muon

weight_decay: 0.095

momentum: 0.99

other_params: {"muon_momentum_warmup_steps":2000}

LR Schedule

warmdown

parameters: {"linear_warmdown":0.72}

Weight Averaging

EMA

parameters: {"decay":0.9965}

Regularization

layerwise LN scale

parameters: null

logit softcap

parameters: {"value":30}

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.005,"momentum":0.9,"epochs":3}

Evaluation

sliding window eval

parameters: null

Quantization

GPTQ

bits: 6

scope: matrices

GPTQ

bits: 8

scope: token embeddings

Compression

lzma

level: null

Novel Contributions

Re-introduced Value Residual Learning (VRL) on top of the bigbag PR #1493 stack
Extended Muon momentum warmup from 1500 to 2000 steps
Reported consistent BPB improvement from VRL and a smaller additional gain from longer momentum warmup
Documented reproduction steps and seed-wise results, including a near-limit artifact size case