PR #343

open

Submission: val_bpb=1.2459 (autoresearch-optimized)

by joeynycView on GitHub
val_bpb
1.2459
Architecture
Optimizer
Muon
Artifact Size
15.9MB

Training Techniques

Quantization
int8
bits: 8
scope: all
Compression
zlib
level: null
Architecture
tied embeddings
Tied input and output embeddings to reduce model size.
parameters: null
KV head count
Used grouped-query attention with 4 attention heads and 2 KV heads.
parameters: {"heads":4,"kv_heads":2}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"momentum":"tuned"}
LR Schedule
warmup schedule
parameters: null
Regularization
gradient clipping
parameters: null
Other
other
Autoresearch-style autonomous hyperparameter search using an AI coding agent over 97 experiments on an RTX 4080, then validation on H100.
parameters: {"experiments":97,"dev_hardware":"RTX 4080","submission_hardware":"8x H100 SXM"}

Novel Contributions

  • Autonomous AI coding agent iteratively optimized hyperparameters (autoresearch pattern).
  • Tied embeddings to reduce model size.
  • Optimizer tuning including Muon momentum, warmup schedule, and gradient clipping.
  • Attention configuration with 4 heads and 2 KV heads via GQA.
  • Learning rate adjustments across all parameter groups.
  • Validated on 8x H100 SXM after development on a single RTX 4080.