PR #2000
openNon-record: Ablation Study — Untied Embeddings (val_bpb 1.3302)
by edidiongumohView on GitHub
val_bpb
1.3302
Architecture
Transformer
Optimizer
Muon
Artifact Size
13,017,328 bytes
Training Techniques
Architecture
weight tying
Untied input embeddings and output head by disabling embedding tying.
parameters: null
GQA
Used grouped query attention with 4 KV heads in the baseline; ablated to MHA with 8 KV heads.
parameters: {"num_kv_heads":4,"num_heads":8}
KV head count
Ablated KV head sharing from GQA to MHA.
parameters: {"num_kv_heads":8}
attention
Flash attention / SDPA backend used in baseline; compared against math kernel.
parameters: {"backend":"flash"}
Quantization
int8
bits: 8
scope: all
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"scalars_and_embeddings":"Adam","matrix_params":"Muon"}
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Compression
zlib
level: null
Novel Contributions
- Ablation study isolating six architectural components of the baseline train_gpt.py across 27 runs and 3 seeds.
- Untied embeddings (weight tying disabled) improved validation bpb over baseline.
- Cross-hardware validation of the untied-embeddings result on both RTX PRO 6000 Blackwell and H100 SXM.
- Demonstrated that Muon optimizer and flash attention are the largest contributors to model quality in this setup.
- Showed that int8 quantization preserves quality while keeping the artifact under the 16 MB cap.