PR #1034
openNon-record: knowledge distillation teacher-student submission
by Jeneesh1014View on GitHub
val_bpb
1.7195
Architecture
Transformer
Optimizer
—
Artifact Size
~5MB
Training Techniques
Other
other
Teacher-student knowledge distillation with a larger teacher model training on cross-entropy and a smaller student trained on a mix of label cross-entropy and KL divergence to teacher soft predictions.
parameters: {"alpha":0.5,"temperature":4}
Architecture
weight tying
Tied embeddings are enabled in the student model.
parameters: null
Sequence Length
sequence_length
train_length: 1024
eval_length: null
LR Schedule
warmup
parameters: {"warmup_steps":20}
Novel Contributions
- Non-record teacher-student distillation submission for the 16MB track
- Uses a larger teacher model to guide a smaller student model via KL distillation
- Reports an honest partial-run validation score and clearly distinguishes measured results from extrapolated estimates
- Provides reproducible 8×H100 and single-GPU smoke-test commands