val_bpb
1.1884
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,937,608 bytes
Training Techniques
Quantization
mixed int8/fp16
bits: 8
scope: all weights except tok_emb.weight kept in fp16; blocks.5. selectively coarsened
Architecture
tied embeddings
Input and output embeddings are tied.
parameters: {"tie_embeddings":1}
KV head count
Uses grouped-query style attention with fewer KV heads than attention heads.
parameters: {"num_heads":8,"num_kv_heads":4}
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"warmup_start":0.92,"warmup_steps":1500}
LR Schedule
warmdown
parameters: {"warmdown_steps":3000}
Sequence Length
sequence_length
train_length: 4096
eval_length: 4096
Compression
zlib
level: null
Novel Contributions
- Adds a new 10-minute 8xH100 record for the long-context lane
- Uses seq_len=4096 training with TRAIN_BATCH_TOKENS=393216
- Keeps tok_emb.weight in fp16 while coarsening only blocks.5. to recover bytes
- Tunes the Muon schedule for the submission
- Includes canonical run plus reproducibility reruns and exact train_gpt.py snapshot