PR #1421

open

[Record] 11L Depth Recurrence + EMA Tuning (0.9965) — val_bpb 1.0925

by X-Abhishek-XView on GitHub

val_bpb

1.0925

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.95 MB

Training Techniques

Architecture

depth recurrence

Repeated layers 4 and 5 to create a virtual 13-layer model with recurrence activated during training.

parameters: {"layers":[4,5],"virtual_layers":13,"activated_step":3000}

GQA

Used grouped query attention with 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

weight tying

Tied input and output embeddings.

parameters: null

VE128

Shared Value Embedding used in layers 9 and 10 with 128-dimensional value embeddings.

parameters: {"dimensions":128,"layers":[9,10]}

Weight Averaging

EMA

parameters: {"decay":0.9965,"every_step":true}

Optimizer

Muon

weight_decay: 0.09

momentum: 0.99

other_params: {"lr":0.02,"backend_steps":5}

Adam

weight_decay: null

momentum: null

other_params: {"lr":0.008,"fused":true,"role":"head"}

AdamW

weight_decay: 0.09

momentum: null

other_params: {"lr":0.6,"fused":true,"role":"embeddings"}

AdamW

weight_decay: 0.02

momentum: null

other_params: {"lr":0.02,"fused":true,"role":"scalars"}

Evaluation

sliding window eval

parameters: {"stride":64}

Quantization

GPTQ

bits: 6

scope: all

Regularization

logit softcap

parameters: {"value":30}

magnitude pruning

parameters: {"type":"selective pruning","values_pruned":290000}

Compression

brotli

level: null

LR Schedule

warmdown

parameters: {"warmdown_fraction":0.667}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

Novel Contributions

EMA decay tuning from 0.997 to 0.9965 to improve post-quantization stability
Depth recurrence architecture with repeated layers 4 and 5
Selective pruning to fit GPTQ int6 artifacts under the 16MB limit
Record-setting 3-seed mean val_bpb of 1.0925