PR #1034

open

Non-record: knowledge distillation teacher-student submission

by Jeneesh1014View on GitHub

val_bpb

1.7195

Architecture

Transformer

Optimizer

—

Artifact Size

~5MB

Training Techniques

Other

other

Teacher-student knowledge distillation with a larger teacher model training on cross-entropy and a smaller student trained on a mix of label cross-entropy and KL divergence to teacher soft predictions.

parameters: {"alpha":0.5,"temperature":4}

Architecture

weight tying

Tied embeddings are enabled in the student model.

parameters: null

Sequence Length

sequence_length

train_length: 1024

eval_length: null

LR Schedule

warmup

parameters: {"warmup_steps":20}

Novel Contributions

Non-record teacher-student distillation submission for the 16MB track
Uses a larger teacher model to guide a smaller student model via KL distillation
Reports an honest partial-run validation score and clearly distinguishes measured results from extrapolated estimates
Provides reproducible 8×H100 and single-GPU smoke-test commands