PR #681

open

Non-record: BigramHash(4096) + Cosine EMA + LZMA-9

by AlfaxadView on GitHub
val_bpb
1.4775
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
7.9MB

Training Techniques

Architecture
BigramHash
Expanded bigram hash embedding table to capture richer local context.
parameters: {"vocab_size":4096}
RoPE
Partial rotary positional embeddings applied to a subset of dimensions.
parameters: {"dimensions":"16/64"}
XSA
XSA applied to the last 4 layers.
parameters: {"layers":4}
MLP3x
Three-times MLP with LeakyReLU squared activation.
parameters: {"activation":"LeakyReLU(0.5)^2"}
Weight Averaging
EMA
parameters: {"schedule":"cosine","start_decay":0.99,"end_decay":0.999}
SWA
parameters: {"frequency":50}
Quantization
GPTQ-lite int6
bits: 6
scope: all
QAT
bits: 6
scope: all
Compression
lzma
level: 9
Other
other
Earlier late QAT activation to adapt sooner during warmdown.
parameters: {"threshold":0.1}
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: null
Regularization
layerwise LN scale
parameters: {"scale":"1/sqrt(layer+1)"}

Novel Contributions

  • Expanded BigramHash vocabulary from 2048 to 4096
  • Replaced fixed EMA decay with a cosine EMA schedule from 0.99 to 0.999
  • Activated late QAT earlier by lowering the threshold from 0.15 to 0.10
  • Increased LZMA compression preset from 6 to 9
  • Used ShinkaEvolve with GPT-5.4 and Gemini 3 Pro as mutation operators