PR #190

closed

The Stinky Frost Recipe — 1.1725 BPB

by newjordanView on GitHub

val_bpb

1.1725

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.58MB

Training Techniques

Quantization

STE QAT

bits: 6

scope: all weight matrices except embeddings

Architecture

tied embeddings

Token embeddings are kept in FP16 and tied, preserving token distinguishability under int6 quantization.

parameters: {"fp16":true}

SmearGate

Learned per-dimension gate blending each token embedding with the previous token embedding.

parameters: {"parameters":512}

BigramHash

Hash-based embedding table for consecutive token pairs to inject bigram context before the first transformer layer.

parameters: {"buckets":4096,"dimension":128}

KV head count

Uses fewer key/value heads than attention heads.

parameters: {"num_heads":8,"num_kv_heads":4}

MLP3x

Custom MLP hidden size of 1344 to maximize capacity while fitting within the artifact size limit.

parameters: {"mlp_hidden":1344}

Optimizer

Muon

weight_decay: 0.01

momentum: null

other_params: null

Initialization

OrthoInit

Orthogonal initialization for all large linear layers, with zero-init output projections.

Evaluation

sliding window eval

parameters: {"stride":64}

Regularization

weight decay

parameters: {"weight_decay":0.01}

Compression

zlib

level: null

Sequence Length

sequence_length

train_length: 1024

eval_length: null

Novel Contributions

Int6 quantization with early QAT starting at 25% of training
FP16 tied embeddings to preserve token distinguishability under quantization
Custom MLP hidden size of 1344 to fit within the 16MB artifact limit
SmearGate learned embedding blending with previous-token context
BigramHash embedding for direct bigram context before the first transformer layer
Orthogonal initialization for large linear layers
Muon optimizer with decoupled weight decay
Sliding window evaluation with stride 64