val_bpb
1.1510
Architecture
Transformer
Optimizer
Muon
Artifact Size
16.1 MB
Training Techniques
Architecture
BigramHash
Hashes previous/current token pairs into a learned bigram embedding added to token embeddings.
parameters: {"bigram_vocab":10240,"bigram_dim":128}
SmearGate
Learned per-dimension sigmoid gate blending each token embedding with the previous token embedding.
parameters: null
MLP3x
Wider feed-forward network with 3x MLP width.
parameters: {"mlp_mult":3}
GQA
Grouped query attention with fewer KV heads than attention heads.
parameters: {"num_heads":8,"num_kv_heads":4}
RoPE
Rotary positional embeddings.
parameters: {"rope_base":10000}
U-Net skip connections
Skip connections between encoder and decoder halves.
parameters: null
Quantization
STE QAT
bits: 6
scope: all large weight matrices
Compression
zstd
level: 22
Weight Averaging
SWA
parameters: {"final_steps":600,"snapshot_interval":50}
Evaluation
sliding window eval
parameters: {"stride":64}
Regularization
logit softcap
parameters: {"value":30}
Initialization
OrthoInit
Orthogonal weight initialization.
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: {"matrix_params":true,"embeddings_and_scalars":"Adam"}
LR Schedule
warmdown
parameters: {"warmup_steps":20,"warmdown_steps":1200}
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Novel Contributions
- BigramHash embedding
- SmearGate
- Int6 QAT with STE
- zstd-22 artifact compression
- SWA over final checkpoints
- sliding window evaluation