PR #251

open

Add SP4096 11L432 MLP3x Int6+Zstd Momentum99 record (val_bpb=1.1596)

by kshitizz36View on GitHub
val_bpb
1.1596
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.3MB

Training Techniques

Architecture
MLP3x
Increased MLP expansion from 2x to 3x to add model capacity.
parameters: {"mlp_mult":3}
tied embeddings
Uses tied embeddings with fp16 embedding passthrough during quantization.
parameters: null
KV head count
Uses grouped-query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
Optimizer
Muon
weight_decay: 0.02
momentum: 0.99
other_params: null
Regularization
weight decay
parameters: {"muon_wd":0.02,"adam_wd":0.02}
Quantization
int6
bits: 6
scope: all except fp16 embeddings
fp16
bits: 16
scope: embeddings
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 1024
eval_length: 1024
LR Schedule
warmdown
parameters: {"warmdown_steps":3000}
Initialization
spectral init
Tied embeddings use overtone spectral initialization.

Novel Contributions

  • New SOTA validation score of 1.1596 bpb
  • 11-layer SP-4096 Transformer with dim 432
  • 3x MLP expansion with relu^2 activation
  • Muon optimizer momentum increased to 0.99
  • Int6 post-training quantization with zstd-22 compression
  • fp16 embedding passthrough to preserve embedding quality
  • Sliding-window evaluation with stride 64
  • Tied embeddings with overtone spectral initialization