val_bpb
1.5140
Architecture
Transformer
Optimizer
Muon + AdamW
Artifact Size
2,033,640 bytes
Training Techniques
Quantization
int8
bits: 8
scope: all
Architecture
weight tying
Tied transformer block weights across layers with per-layer norms and gates unchanged (Family 1A)
parameters: null
Optimizer
Muon + AdamW
weight_decay: null
momentum: null
other_params: null
Regularization
gradient clipping
parameters: {"clip_value":1,"type":"global"}
LR Schedule
linear warmup
parameters: {"warmup_steps":30}
Novel Contributions
- Reproducible snapshot of Family 1 / Batch 1A with tied transformer block weights
- Stable training recipe including global grad clip 1.0 and 30-step linear data warmup
- Use of Muon + AdamW optimizer combination as in train_gpt.py
- Submission targets a 1×GPU 600s wallclock cap run, not the official 8×H100 10-minute record track