PR #1029

open

Non-record: Knowledge Distillation - A Negative Result (val_bpb=1.152)

val_bpb

1.1520

Architecture

Transformer

Optimizer

—

Artifact Size

15.4 MB

Training Techniques

Architecture

BigramHash

Input-side bigram hashing used in the model setup; does not affect the prediction head.

parameters: {"vocab_size":6144}

XSA

Uses XSA in the last layers of the student model.

parameters: {"last_layers":4}

GQA

Grouped query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

ReLU²

Uses relu-squared MLP activation.

parameters: null

Weight Averaging

EMA

parameters: {"decay":0.997}

Quantization

late QAT

bits: null

scope: model

LR Schedule

warmdown

parameters: {"warmdown_iters":1600}

Evaluation

sliding window eval

parameters: null

Other

other

Knowledge distillation using both hard label replacement with teacher top-1 predictions and soft KL distillation against cached teacher logits.

parameters: {"teacher_params":105500000,"top_k_logits":32,"temperature":2,"alpha_values":[0.1,0.3,0.5]}

First distillation experiment in Parameter Golf
Systematic comparison of hard distillation and soft KL distillation under tight step-budget constraints
Cached top-32 teacher logits to make distillation feasible within the training budget
Extended training analysis showing distillation does not cross the baseline even with more time
Demonstration that online teacher inference is too expensive for this setting