PR #536

open

Non-record: Family 1A tied blocks (1xH100 dev snapshot)

val_bpb

1.5140

Architecture

Transformer

Optimizer

Muon + AdamW

Artifact Size

2,033,640 bytes

Training Techniques

Quantization

int8

bits: 8

scope: all

Architecture

weight tying

Tied transformer block weights across layers with per-layer norms and gates unchanged (Family 1A)

parameters: null

Optimizer

Muon + AdamW

weight_decay: null

momentum: null

other_params: null

Regularization

gradient clipping

parameters: {"clip_value":1,"type":"global"}

LR Schedule

linear warmup

parameters: {"warmup_steps":30}

Reproducible snapshot of Family 1 / Batch 1A with tied transformer block weights
Stable training recipe including global grad clip 1.0 and 30-step linear data warmup
Use of Muon + AdamW optimizer combination as in train_gpt.py
Submission targets a 1×GPU 600s wallclock cap run, not the official 8×H100 10-minute record track