PR #1395

open

Record: SP4096 + Linear LR + Depth Recurrence -- val_bpb=1.0924 (3-seed mean)

by dttdrvView on GitHub

val_bpb

1.0924

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.99 MB

Training Techniques

Architecture

SP4096

SentencePiece BPE vocabulary of size 4096 with 11-layer, 512-dim transformer backbone.

parameters: {"layers":11,"dimensions":512,"vocab_size":4096}

LeakyReLU

MLP uses LeakyReLU(0.5)^2 activation.

parameters: {"slope":0.5}

depth recurrence

Layers 4 and 5 are repeated starting from step 3000.

parameters: {"layers":[4,5],"start_step":3000}

U-Net skip connections

Gated encoder-decoder style skip connections are used.

parameters: null

XSA

Exclusive Self Attention applied to all 11 layers.

parameters: {"layers":11}

QK-Gain

Attention QK gain set to 5.0.

parameters: {"value":5}

RoPE

Partial rotary positional embeddings.

parameters: {"dimensions":"16/64"}

SmearGate

Learned token blending mechanism.

parameters: null

Optimizer

Muon

weight_decay: 0.09

momentum: 0.99

other_params: {"matrix_lr":0.02,"adamw_scalars_embeddings":true,"adam_weight_decay":0.02,"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}

Weight Averaging

EMA

parameters: {"decay":0.997,"every_step":true}

Quantization

GPTQ

bits: 6

scope: all attention + MLP weight matrices

int8

bits: 8

scope: embeddings

Compression

Brotli

level: 10

Evaluation

sliding window eval

parameters: {"stride":64}

LR Schedule

warmdown

parameters: {"fraction":0.667,"final_lr":0,"type":"linear"}

Regularization

weight decay

parameters: {"muon_wd":0.09,"adam_wd":0.02}

magnitude pruning

parameters: {"factor":4}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

Novel Contributions

Linear warmdown to zero instead of cosine decay with a non-zero floor
Reduced selective pruning factor from 8x excess to 4x excess
Improved quantization gap and compression enough to achieve a new record val_bpb
Depth recurrence combined with SP4096 architecture and MuonEq-R training