PR #1395
openRecord: SP4096 + Linear LR + Depth Recurrence -- val_bpb=1.0924 (3-seed mean)
by dttdrvView on GitHub
val_bpb
1.0924
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.99 MB
Training Techniques
Architecture
SP4096
SentencePiece BPE vocabulary of size 4096 with 11-layer, 512-dim transformer backbone.
parameters: {"layers":11,"dimensions":512,"vocab_size":4096}
LeakyReLU
MLP uses LeakyReLU(0.5)^2 activation.
parameters: {"slope":0.5}
depth recurrence
Layers 4 and 5 are repeated starting from step 3000.
parameters: {"layers":[4,5],"start_step":3000}
U-Net skip connections
Gated encoder-decoder style skip connections are used.
parameters: null
XSA
Exclusive Self Attention applied to all 11 layers.
parameters: {"layers":11}
QK-Gain
Attention QK gain set to 5.0.
parameters: {"value":5}
RoPE
Partial rotary positional embeddings.
parameters: {"dimensions":"16/64"}
SmearGate
Learned token blending mechanism.
parameters: null
Optimizer
Muon
weight_decay: 0.09
momentum: 0.99
other_params: {"matrix_lr":0.02,"adamw_scalars_embeddings":true,"adam_weight_decay":0.02,"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}
Weight Averaging
EMA
parameters: {"decay":0.997,"every_step":true}
Quantization
GPTQ
bits: 6
scope: all attention + MLP weight matrices
int8
bits: 8
scope: embeddings
Compression
Brotli
level: 10
Evaluation
sliding window eval
parameters: {"stride":64}
LR Schedule
warmdown
parameters: {"fraction":0.667,"final_lr":0,"type":"linear"}
Regularization
weight decay
parameters: {"muon_wd":0.09,"adam_wd":0.02}
magnitude pruning
parameters: {"factor":4}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Novel Contributions
- Linear warmdown to zero instead of cosine decay with a non-zero floor
- Reduced selective pruning factor from 8x excess to 4x excess
- Improved quantization gap and compression enough to achieve a new record val_bpb
- Depth recurrence combined with SP4096 architecture and MuonEq-R training