val_bpb
1.2459
Architecture
—
Optimizer
Muon
Artifact Size
15.9MB
Training Techniques
Quantization
int8
bits: 8
scope: all
Compression
zlib
level: null
Architecture
tied embeddings
Tied input and output embeddings to reduce model size.
parameters: null
KV head count
Used grouped-query attention with 4 attention heads and 2 KV heads.
parameters: {"heads":4,"kv_heads":2}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"momentum":"tuned"}
LR Schedule
warmup schedule
parameters: null
Regularization
gradient clipping
parameters: null
Other
other
Autoresearch-style autonomous hyperparameter search using an AI coding agent over 97 experiments on an RTX 4080, then validation on H100.
parameters: {"experiments":97,"dev_hardware":"RTX 4080","submission_hardware":"8x H100 SXM"}
Novel Contributions
- Autonomous AI coding agent iteratively optimized hyperparameters (autoresearch pattern).
- Tied embeddings to reduce model size.
- Optimizer tuning including Muon momentum, warmup schedule, and gradient clipping.
- Attention configuration with 4 heads and 2 KV heads via GQA.
- Learning rate adjustments across all parameter groups.
- Validated on 8x H100 SXM after development on a single RTX 4080.