PR #42

closed

fp16 tied embedding + warmdown/LR tuning (val_bpb 1.2197)

by chonchiogView on GitHub
val_bpb
1.2197
Architecture
Transformer
Optimizer
Artifact Size
15.90MB

Training Techniques

Quantization
fp16
bits: 16
scope: tied embeddings / output head
Architecture
tied embeddings
Kept the tied token embedding in fp16 because it also serves as the output head, reducing quantization loss.
parameters: {"tie_embeddings":1}
MLP hidden size
Reduced MLP hidden dimension to fit under the 16MB artifact limit.
parameters: {"mlp_hidden":992}
LR Schedule
warmdown
parameters: {"warmdown_steps":3600}
Other
other
Increased matrix learning rate to better match the short 10-minute training budget.
parameters: {"matrix_lr":0.06}

Novel Contributions

  • Kept the tied embedding in fp16 during export instead of int8 quantizing it.
  • Reduced quantization gap from about 0.007 BPB to about 0.0005 BPB.
  • Shrank MLP hidden size from 1024 to 992 to stay under the 16MB limit.
  • Tuned warmdown from 1200 to 3600 steps.
  • Increased matrix learning rate from 0.04 to 0.06.
  • Observed that disabling NCCL_IB_DISABLE improves throughput on IB/NVLink pods.