val_bpb
1.1725
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.58MB
Training Techniques
Quantization
STE QAT
bits: 6
scope: all weight matrices except embeddings
Architecture
tied embeddings
Token embeddings are kept in FP16 and tied, preserving token distinguishability under int6 quantization.
parameters: {"fp16":true}
SmearGate
Learned per-dimension gate blending each token embedding with the previous token embedding.
parameters: {"parameters":512}
BigramHash
Hash-based embedding table for consecutive token pairs to inject bigram context before the first transformer layer.
parameters: {"buckets":4096,"dimension":128}
KV head count
Uses fewer key/value heads than attention heads.
parameters: {"num_heads":8,"num_kv_heads":4}
MLP3x
Custom MLP hidden size of 1344 to maximize capacity while fitting within the artifact size limit.
parameters: {"mlp_hidden":1344}
Optimizer
Muon
weight_decay: 0.01
momentum: null
other_params: null
Initialization
OrthoInit
Orthogonal initialization for all large linear layers, with zero-init output projections.
Evaluation
sliding window eval
parameters: {"stride":64}
Regularization
weight decay
parameters: {"weight_decay":0.01}
Compression
zlib
level: null
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Novel Contributions
- Int6 quantization with early QAT starting at 25% of training
- FP16 tied embeddings to preserve token distinguishability under quantization
- Custom MLP hidden size of 1344 to fit within the 16MB artifact limit
- SmearGate learned embedding blending with previous-token context
- BigramHash embedding for direct bigram context before the first transformer layer
- Orthogonal initialization for large linear layers
- Muon optimizer with decoupled weight decay
- Sliding window evaluation with stride 64