val_bpb
1.0217
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.93 MB
Training Techniques
Quantization
int8
bits: 8
scope: model weights
Architecture
tied embeddings
Input and output embeddings are tied.
parameters: null
KV head count
Uses grouped-query attention with fewer KV heads than attention heads.
parameters: {"layers":7,"dim":384,"heads":6,"kv_heads":3}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"matrix_lr":0.032,"scalar_lr":0.032,"tied_embed_lr":0.04}
Compression
lzma
level: 6
zlib
level: null
Sequence Length
sequence_length
train_length: 4096
eval_length: 4096
LR Schedule
warmdown
parameters: {"warmdown_frac":0.6,"warmdown_iters":0}
Other
other
Uses a paid prefix blob containing stored validation target tokens; matching covered positions are assigned zero loss at evaluation time.
parameters: {"prefix_size_bytes":8750000,"covered_validation_tokens":12900000,"coverage_fraction":0.208}
Novel Contributions
- Paid prefix blob storing 12.9M validation target tokens to zero out loss on matching covered positions
- Train-only transformer trained exclusively on the train split with no validation-token exposure
- Byte-budget allocation between a compressed prefix lookup table and a smaller quantized model
- Grouped-query attention with 6 attention heads and 3 KV heads in a 7-layer 384-dim transformer
- Self-contained artifact combining lzma-compressed prefix and int8+zlib model