val_bpb
1.1487
Architecture
Transformer
Optimizer
Muon
Artifact Size
14.9 MB
Training Techniques
Architecture
MLP3x
10-layer transformer with relu² MLP expanded to 3x hidden size.
parameters: {"layers":10,"hidden":1536}
BigramHash
Adds BigramHash features for n-gram information.
parameters: {"vocab":2048,"dim":128}
SmearGate
Learnable previous-token blending mechanism.
parameters: null
weight tying
Input and output embeddings are tied.
parameters: null
KV head count
Uses grouped-query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
Initialization
OrthoInit
Orthogonal initialization with scaled projections.
Quantization
mixed int5/int6
bits: null
scope: MLP int5, attention int6, embeddings fp16
Compression
zstd
level: 22
Weight Averaging
SWA
parameters: {"checkpoints_averaged":24}
Evaluation
sliding window eval
parameters: {"stride":32,"context_length":2048}
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: {"adamw_for_embeddings_scalars":true}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Regularization
weight decay
parameters: {"value":0.04}
Novel Contributions
- 10-layer relu² MLP3x transformer
- BigramHash(2048) with SmearGate
- Orthogonal initialization
- Mixed int5/int6 quantization with zstd-22 compression
- SWA averaging over late checkpoints
- Stride-32 dense sliding-window evaluation