PR #1445

open

[Record] 3-Layer Depth Recurrence + EMA 0.9965 + WD 0.095 — val_bpb 1.0889

by X-Abhishek-XView on GitHub

val_bpb

1.0889

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.89 MB

Training Techniques

Architecture

depth recurrence

Repeats layers 3, 4, and 5 as a 3-layer recurrence, creating 14 virtual layers from 11 physical layers.

parameters: {"layers":[3,4,5],"virtual_layers":14,"physical_layers":11,"start_step":2000}

weight tying

Tied input and output embeddings.

parameters: null

GQA

Uses grouped query attention with 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

VE128

Shared Value Embedding used in layers 9 and 10.

parameters: {"dimensions":128,"layers":[9,10]}

U-Net skip connections

Skip gates and parallel residual connections from layer 7.

parameters: {"from_layer":7}

Weight Averaging

EMA

parameters: {"decay":0.9965}

Optimizer

Muon

weight_decay: 0.095

momentum: 0.99

other_params: {"lr":0.022,"backend_steps":5}

Adam

weight_decay: null

momentum: null

other_params: {"lr":0.008,"fused":true,"role":"head"}

AdamW

weight_decay: 0.095

momentum: null

other_params: {"lr":0.6,"fused":true,"role":"embeddings"}

AdamW

weight_decay: 0.02

momentum: null

other_params: {"lr":0.02,"fused":true,"role":"scalars"}

Regularization

weight decay

parameters: {"value":0.095}

logit softcap

parameters: {"value":30}

Evaluation

sliding window eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

warmdown

parameters: {"warmdown_frac":0.72}

Quantization

GPTQ

bits: 6

scope: all

Compression

Brotli

level: null

Novel Contributions

3-layer depth recurrence over layers 3, 4, and 5
Earlier recurrence activation at step 2000
Higher weight decay and matrix learning rate tuning for better GPTQ quantization
EMA decay tuned to 0.9965
Extended warmdown fraction to 72%
Record low val_bpb of 1.0889 with all artifacts under 16MB