PR #483
closedTrack 10min_16mb: PR #287 family rerun at 585s wallclock (mean val_bpb=1.1346)
by tmustierView on GitHub
val_bpb
1.1346
Architecture
Transformer
Optimizer
Muon
Artifact Size
16,000,000 bytes
Training Techniques
Architecture
XSA
Uses XSA with the last 4 layers configured for the rerun family.
parameters: {"last_n":4}
BigramHash
Adds a bigram hashing component to the model.
parameters: {"vocab_size":2048,"dim":128}
MLP3x
Uses an expanded MLP width multiplier.
parameters: {"mlp_mult":3}
KV head count
Uses 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
weight tying
Uses tied embeddings.
parameters: null
Quantization
QAT
bits: 6
scope: all
Weight Averaging
EMA
parameters: {"decay":0.997}
Evaluation
stride-based eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmdown_iters":3000,"warmup_steps":20}
Regularization
weight decay
parameters: {"muon_wd":0.04,"adam_wd":0.04}
Other
other
Uses FlashAttention 3 for training.
parameters: null
other
Uses int6 + zstd export to fit the artifact size limit.
parameters: null
Novel Contributions
- 3-seed rerun of the PR #287 family under a 585s wallclock cap
- Use of FlashAttention 3 on 8×H100 SXM
- Combination of XSA, EMA, BigramHash, and QAT
- int6 + zstd export to keep all seeds under the 16MB artifact limit
- Achieved mean val_bpb of 1.1346, beating merged SOTA 1.1428