val_bpb
1.1779
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,981,108 bytes
Training Techniques
Architecture
tied embeddings
Input and output embeddings are tied.
parameters: null
KV head count
Uses grouped-query style attention with fewer KV heads than query heads.
parameters: {"num_heads":8,"num_kv_heads":4}
Quantization
int8
bits: 8
scope: final model with hybrid fp16/int8 token embeddings
Evaluation
sliding window eval
parameters: {"context_length":2048,"stride_tokens":512}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"matrix_lr":0.02,"scalar_lr":0.02,"tied_embed_lr":0.03}
LR Schedule
warmup + warmdown
parameters: {"warmup_steps":20,"warmdown_iters":3000}
Regularization
gradient clipping
parameters: {"grad_clip_norm":0.3}
Compression
zlib
level: null
Other
other
Clip-search post-training quantization with candidate clipping thresholds.
parameters: {"candidates":[1,0.95,0.9,0.85]}
Novel Contributions
- Clean under-cap 8xH100 snapshot for the 10 minute / 16,000,000 byte track
- Clip-search PTQ
- Hybrid fp16/int8 export for token embeddings with top rows kept in fp16
- Sliding-window validation at 2048 context with 512-token stride
- Tied-embedding dense transformer baseline with grouped KV heads