PR #536

open

Non-record: Family 1A tied blocks (1xH100 dev snapshot)

by jaksencView on GitHub
val_bpb
1.5140
Architecture
Transformer
Optimizer
Muon + AdamW
Artifact Size
2,033,640 bytes

Training Techniques

Quantization
int8
bits: 8
scope: all
Architecture
weight tying
Tied transformer block weights across layers with per-layer norms and gates unchanged (Family 1A)
parameters: null
Optimizer
Muon + AdamW
weight_decay: null
momentum: null
other_params: null
Regularization
gradient clipping
parameters: {"clip_value":1,"type":"global"}
LR Schedule
linear warmup
parameters: {"warmup_steps":30}

Novel Contributions

  • Reproducible snapshot of Family 1 / Batch 1A with tied transformer block weights
  • Stable training recipe including global grad clip 1.0 and 30-step linear data warmup
  • Use of Muon + AdamW optimizer combination as in train_gpt.py
  • Submission targets a 1×GPU 600s wallclock cap run, not the official 8×H100 10-minute record track