val_bpb
1.1768
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,943,260 bytes
Training Techniques
Quantization
int8
bits: 8
scope: all weights except tok_emb.weight; selective coarsening on blocks.5.
Architecture
tied embeddings
Input and output embeddings are tied.
parameters: {"tie_embeddings":1}
KV head count
Uses grouped-query style attention with fewer KV heads than attention heads.
parameters: {"num_heads":8,"num_kv_heads":4}
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"warmup_start":0.92,"warmup_steps":1500}
LR Schedule
warmdown
parameters: {"warmdown_iters":3000}
Evaluation
sliding window eval
parameters: {"stride":64,"window_length":4096,"batch_size":32}
Sequence Length
sequence_length
train_length: 4096
eval_length: 4096
Compression
zlib
level: null
Novel Contributions
- Adds a new 10-minute 8xH100 sliding-window record.
- Uses stride-64 sliding-window evaluation after standard exact roundtrip checking.
- Keeps tok_emb.weight in fp16 while coarsening only blocks.5. to fit the artifact budget.
- Trains at sequence length 4096 with a tuned Muon schedule.