PR #1029
openNon-record: Knowledge Distillation - A Negative Result (val_bpb=1.152)
by fieldingView on GitHub
val_bpb
1.1520
Architecture
Transformer
Optimizer
—
Artifact Size
15.4 MB
Training Techniques
Architecture
BigramHash
Input-side bigram hashing used in the model setup; does not affect the prediction head.
parameters: {"vocab_size":6144}
XSA
Uses XSA in the last layers of the student model.
parameters: {"last_layers":4}
GQA
Grouped query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
ReLU²
Uses relu-squared MLP activation.
parameters: null
Weight Averaging
EMA
parameters: {"decay":0.997}
Quantization
late QAT
bits: null
scope: model
LR Schedule
warmdown
parameters: {"warmdown_iters":1600}
Evaluation
sliding window eval
parameters: null
Other
other
Knowledge distillation using both hard label replacement with teacher top-1 predictions and soft KL distillation against cached teacher logits.
parameters: {"teacher_params":105500000,"top_k_logits":32,"temperature":2,"alpha_values":[0.1,0.3,0.5]}
Novel Contributions
- First distillation experiment in Parameter Golf
- Systematic comparison of hard distillation and soft KL distillation under tight step-budget constraints
- Cached top-32 teacher logits to make distillation feasible within the training budget
- Extended training analysis showing distillation does not cross the baseline even with more time
- Demonstration that online teacher inference is too expensive for this setting