PR #1445

open

[Record] 3-Layer Depth Recurrence + EMA 0.9965 + WD 0.095 — val_bpb 1.0889

by X-Abhishek-XView on GitHub
val_bpb
1.0889
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.89 MB

Training Techniques

Architecture
depth recurrence
Repeats layers 3, 4, and 5 as a 3-layer recurrence, creating 14 virtual layers from 11 physical layers.
parameters: {"layers":[3,4,5],"virtual_layers":14,"physical_layers":11,"start_step":2000}
weight tying
Tied input and output embeddings.
parameters: null
GQA
Uses grouped query attention with 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
VE128
Shared Value Embedding used in layers 9 and 10.
parameters: {"dimensions":128,"layers":[9,10]}
U-Net skip connections
Skip gates and parallel residual connections from layer 7.
parameters: {"from_layer":7}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Optimizer
Muon
weight_decay: 0.095
momentum: 0.99
other_params: {"lr":0.022,"backend_steps":5}
Adam
weight_decay: null
momentum: null
other_params: {"lr":0.008,"fused":true,"role":"head"}
AdamW
weight_decay: 0.095
momentum: null
other_params: {"lr":0.6,"fused":true,"role":"embeddings"}
AdamW
weight_decay: 0.02
momentum: null
other_params: {"lr":0.02,"fused":true,"role":"scalars"}
Regularization
weight decay
parameters: {"value":0.095}
logit softcap
parameters: {"value":30}
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_frac":0.72}
Quantization
GPTQ
bits: 6
scope: all
Compression
Brotli
level: null

Novel Contributions

  • 3-layer depth recurrence over layers 3, 4, and 5
  • Earlier recurrence activation at step 2000
  • Higher weight decay and matrix learning rate tuning for better GPTQ quantization
  • EMA decay tuned to 0.9965
  • Extended warmdown fraction to 72%
  • Record low val_bpb of 1.0889 with all artifacts under 16MB