PR #2000

open

Non-record: Ablation Study — Untied Embeddings (val_bpb 1.3302)

by edidiongumohView on GitHub

val_bpb

1.3302

Architecture

Transformer

Optimizer

Muon

Artifact Size

13,017,328 bytes

Training Techniques

Architecture

weight tying

Untied input embeddings and output head by disabling embedding tying.

parameters: null

GQA

Used grouped query attention with 4 KV heads in the baseline; ablated to MHA with 8 KV heads.

parameters: {"num_kv_heads":4,"num_heads":8}

KV head count

Ablated KV head sharing from GQA to MHA.

parameters: {"num_kv_heads":8}

attention

Flash attention / SDPA backend used in baseline; compared against math kernel.

parameters: {"backend":"flash"}

Quantization

int8

bits: 8

scope: all

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"scalars_and_embeddings":"Adam","matrix_params":"Muon"}

Sequence Length

sequence_length

train_length: 1024

eval_length: null

Compression

zlib

level: null

Ablation study isolating six architectural components of the baseline train_gpt.py across 27 runs and 3 seeds.
Untied embeddings (weight tying disabled) improved validation bpb over baseline.
Cross-hardware validation of the untied-embeddings result on both RTX PRO 6000 Blackwell and H100 SXM.
Demonstrated that Muon optimizer and flash attention are the largest contributors to model quality in this setup.
Showed that int8 quantization preserves quality while keeping the artifact under the 16 MB cap.