val_bpb
1.5546
Architecture
Transformer
Optimizer
Muon
Artifact Size
10030236 bytes
Training Techniques
Quantization
QAT
bits: 6
scope: all
Weight Averaging
EMA
parameters: null
Evaluation
sliding window eval
parameters: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: null
Architecture
weight tying
Tied input embeddings and output head weights.
parameters: null
GQA
Grouped query attention with fewer KV heads than query heads.
parameters: {"num_heads":8,"num_kv_heads":4}
MLP3x
Expanded MLP width to 3x.
parameters: null
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Compression
zstd
level: null
Other
other
EMA evaluation/serialization with int8-style artifact export and roundtrip validation.
parameters: null
Novel Contributions
- 26.5M parameter GPT variant
- int6 QAT
- EMA evaluation and serialization
- sliding-window validation
- Muon optimizer with weight decay tuning
- 2048-token context
- 12-layer, 3x MLP architecture
- zstd-compressed int8-style artifact export