val_bpb
1.4775
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
7.9MB
Training Techniques
Architecture
BigramHash
Expanded bigram hash embedding table to capture richer local context.
parameters: {"vocab_size":4096}
RoPE
Partial rotary positional embeddings applied to a subset of dimensions.
parameters: {"dimensions":"16/64"}
XSA
XSA applied to the last 4 layers.
parameters: {"layers":4}
MLP3x
Three-times MLP with LeakyReLU squared activation.
parameters: {"activation":"LeakyReLU(0.5)^2"}
Weight Averaging
EMA
parameters: {"schedule":"cosine","start_decay":0.99,"end_decay":0.999}
SWA
parameters: {"frequency":50}
Quantization
GPTQ-lite int6
bits: 6
scope: all
QAT
bits: 6
scope: all
Compression
lzma
level: 9
Other
other
Earlier late QAT activation to adapt sooner during warmdown.
parameters: {"threshold":0.1}
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: null
Regularization
layerwise LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
Novel Contributions
- Expanded BigramHash vocabulary from 2048 to 4096
- Replaced fixed EMA decay with a cosine EMA schedule from 0.99 to 0.999
- Activated late QAT earlier by lowering the threshold from 0.15 to 0.10
- Increased LZMA compression preset from 6 to 9
- Used ShinkaEvolve with GPT-5.4 and Gemini 3 Pro as mutation operators