PR #1034

open

Non-record: knowledge distillation teacher-student submission

by Jeneesh1014View on GitHub
val_bpb
1.7195
Architecture
Transformer
Optimizer
Artifact Size
~5MB

Training Techniques

Other
other
Teacher-student knowledge distillation with a larger teacher model training on cross-entropy and a smaller student trained on a mix of label cross-entropy and KL divergence to teacher soft predictions.
parameters: {"alpha":0.5,"temperature":4}
Architecture
weight tying
Tied embeddings are enabled in the student model.
parameters: null
Sequence Length
sequence_length
train_length: 1024
eval_length: null
LR Schedule
warmup
parameters: {"warmup_steps":20}

Novel Contributions

  • Non-record teacher-student distillation submission for the 16MB track
  • Uses a larger teacher model to guide a smaller student model via KL distillation
  • Reports an honest partial-run validation score and clearly distinguishes measured results from extrapolated estimates
  • Provides reproducible 8×H100 and single-GPU smoke-test commands