PR #1421
open[Record] 11L Depth Recurrence + EMA Tuning (0.9965) — val_bpb 1.0925
by X-Abhishek-XView on GitHub
val_bpb
1.0925
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.95 MB
Training Techniques
Architecture
depth recurrence
Repeated layers 4 and 5 to create a virtual 13-layer model with recurrence activated during training.
parameters: {"layers":[4,5],"virtual_layers":13,"activated_step":3000}
GQA
Used grouped query attention with 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
weight tying
Tied input and output embeddings.
parameters: null
VE128
Shared Value Embedding used in layers 9 and 10 with 128-dimensional value embeddings.
parameters: {"dimensions":128,"layers":[9,10]}
Weight Averaging
EMA
parameters: {"decay":0.9965,"every_step":true}
Optimizer
Muon
weight_decay: 0.09
momentum: 0.99
other_params: {"lr":0.02,"backend_steps":5}
Adam
weight_decay: null
momentum: null
other_params: {"lr":0.008,"fused":true,"role":"head"}
AdamW
weight_decay: 0.09
momentum: null
other_params: {"lr":0.6,"fused":true,"role":"embeddings"}
AdamW
weight_decay: 0.02
momentum: null
other_params: {"lr":0.02,"fused":true,"role":"scalars"}
Evaluation
sliding window eval
parameters: {"stride":64}
Quantization
GPTQ
bits: 6
scope: all
Regularization
logit softcap
parameters: {"value":30}
magnitude pruning
parameters: {"type":"selective pruning","values_pruned":290000}
Compression
brotli
level: null
LR Schedule
warmdown
parameters: {"warmdown_fraction":0.667}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Novel Contributions
- EMA decay tuning from 0.997 to 0.9965 to improve post-quantization stability
- Depth recurrence architecture with repeated layers 4 and 5
- Selective pruning to fit GPTQ int6 artifacts under the 16MB limit
- Record-setting 3-seed mean val_bpb of 1.0925