val_bpb
1.2197
Architecture
Transformer
Optimizer
—
Artifact Size
15.90MB
Training Techniques
Quantization
fp16
bits: 16
scope: tied embeddings / output head
Architecture
tied embeddings
Kept the tied token embedding in fp16 because it also serves as the output head, reducing quantization loss.
parameters: {"tie_embeddings":1}
MLP hidden size
Reduced MLP hidden dimension to fit under the 16MB artifact limit.
parameters: {"mlp_hidden":992}
LR Schedule
warmdown
parameters: {"warmdown_steps":3600}
Other
other
Increased matrix learning rate to better match the short 10-minute training budget.
parameters: {"matrix_lr":0.06}
Novel Contributions
- Kept the tied embedding in fp16 during export instead of int8 quantizing it.
- Reduced quantization gap from about 0.007 BPB to about 0.0005 BPB.
- Shrank MLP hidden size from 1024 to 992 to stay under the 16MB limit.
- Tuned warmdown from 1200 to 3600 steps.
- Increased matrix learning rate from 0.04 to 0.06.
- Observed that disabling NCCL_IB_DISABLE improves throughput on IB/NVLink pods.