PR #2000

open

Non-record: Ablation Study — Untied Embeddings (val_bpb 1.3302)

by edidiongumohView on GitHub
val_bpb
1.3302
Architecture
Transformer
Optimizer
Muon
Artifact Size
13,017,328 bytes

Training Techniques

Architecture
weight tying
Untied input embeddings and output head by disabling embedding tying.
parameters: null
GQA
Used grouped query attention with 4 KV heads in the baseline; ablated to MHA with 8 KV heads.
parameters: {"num_kv_heads":4,"num_heads":8}
KV head count
Ablated KV head sharing from GQA to MHA.
parameters: {"num_kv_heads":8}
attention
Flash attention / SDPA backend used in baseline; compared against math kernel.
parameters: {"backend":"flash"}
Quantization
int8
bits: 8
scope: all
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"scalars_and_embeddings":"Adam","matrix_params":"Muon"}
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Compression
zlib
level: null

Novel Contributions

  • Ablation study isolating six architectural components of the baseline train_gpt.py across 27 runs and 3 seeds.
  • Untied embeddings (weight tying disabled) improved validation bpb over baseline.
  • Cross-hardware validation of the untied-embeddings result on both RTX PRO 6000 Blackwell and H100 SXM.
  • Demonstrated that Muon optimizer and flash attention are the largest contributors to model quality in this setup.
  • Showed that int8 quantization preserves quality while keeping the artifact under the 16 MB cap.