PR #1508

open

Record: SP4096 + Compressibility Regularization — val_bpb 1.11349 (6-seed mean)

by jpfeiffeView on GitHub
val_bpb
1.1135
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.68 MB

Training Techniques

Quantization
GPTQ
bits: 6
scope: all
Architecture
weight tying
Tied embedding / tied embeddings with SP4096 tokenizer
parameters: {"vocab_size":4096}
GQA
Grouped query attention with 8 attention heads and 4 KV heads
parameters: {"heads":8,"kv_heads":4}
XSA
XSA attention used on all layers
parameters: {"layers":11}
BigramHash
BigramHash component in the architecture
parameters: {"dimensions":112,"size":3072}
MLP3x
3x MLP with LeakyReLU squared activation
parameters: {"mlp_multiplier":3,"activation":"LeakyReLU"}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: null
LR Schedule
warmdown
parameters: {"iters":4000}
Regularization
weight decay
parameters: {"warmdown_multiplier":2}
Compression
brotli
level: 11
lzma
level: 9

Novel Contributions

  • SP4096 tokenizer replacing SP1024
  • Warmdown weight decay multiplier of 2.0 to increase compressibility
  • Selecting the smaller of brotli-11 and lzma-9 for artifact compression
  • Achieved 6-seed mean val_bpb of 1.11349 with 0% pruning and all artifacts under 16MB