PR #1508
openRecord: SP4096 + Compressibility Regularization — val_bpb 1.11349 (6-seed mean)
by jpfeiffeView on GitHub
val_bpb
1.1135
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.68 MB
Training Techniques
Quantization
GPTQ
bits: 6
scope: all
Architecture
weight tying
Tied embedding / tied embeddings with SP4096 tokenizer
parameters: {"vocab_size":4096}
GQA
Grouped query attention with 8 attention heads and 4 KV heads
parameters: {"heads":8,"kv_heads":4}
XSA
XSA attention used on all layers
parameters: {"layers":11}
BigramHash
BigramHash component in the architecture
parameters: {"dimensions":112,"size":3072}
MLP3x
3x MLP with LeakyReLU squared activation
parameters: {"mlp_multiplier":3,"activation":"LeakyReLU"}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: null
LR Schedule
warmdown
parameters: {"iters":4000}
Regularization
weight decay
parameters: {"warmdown_multiplier":2}
Compression
brotli
level: 11
lzma
level: 9
Novel Contributions
- SP4096 tokenizer replacing SP1024
- Warmdown weight decay multiplier of 2.0 to increase compressibility
- Selecting the smaller of brotli-11 and lzma-9 for artifact compression
- Achieved 6-seed mean val_bpb of 1.11349 with 0% pruning and all artifacts under 16MB