PR #190

closed

The Stinky Frost Recipe — 1.1725 BPB

by newjordanView on GitHub
val_bpb
1.1725
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.58MB

Training Techniques

Quantization
STE QAT
bits: 6
scope: all weight matrices except embeddings
Architecture
tied embeddings
Token embeddings are kept in FP16 and tied, preserving token distinguishability under int6 quantization.
parameters: {"fp16":true}
SmearGate
Learned per-dimension gate blending each token embedding with the previous token embedding.
parameters: {"parameters":512}
BigramHash
Hash-based embedding table for consecutive token pairs to inject bigram context before the first transformer layer.
parameters: {"buckets":4096,"dimension":128}
KV head count
Uses fewer key/value heads than attention heads.
parameters: {"num_heads":8,"num_kv_heads":4}
MLP3x
Custom MLP hidden size of 1344 to maximize capacity while fitting within the artifact size limit.
parameters: {"mlp_hidden":1344}
Optimizer
Muon
weight_decay: 0.01
momentum: null
other_params: null
Initialization
OrthoInit
Orthogonal initialization for all large linear layers, with zero-init output projections.
Evaluation
sliding window eval
parameters: {"stride":64}
Regularization
weight decay
parameters: {"weight_decay":0.01}
Compression
zlib
level: null
Sequence Length
sequence_length
train_length: 1024
eval_length: null

Novel Contributions

  • Int6 quantization with early QAT starting at 25% of training
  • FP16 tied embeddings to preserve token distinguishability under quantization
  • Custom MLP hidden size of 1344 to fit within the 16MB artifact limit
  • SmearGate learned embedding blending with previous-token context
  • BigramHash embedding for direct bigram context before the first transformer layer
  • Orthogonal initialization for large linear layers
  • Muon optimizer with decoupled weight decay
  • Sliding window evaluation with stride 64