PR #42

closed

fp16 tied embedding + warmdown/LR tuning (val_bpb 1.2197)

val_bpb

1.2197

Architecture

Transformer

Optimizer

—

Artifact Size

15.90MB

Training Techniques

Quantization

fp16

bits: 16

scope: tied embeddings / output head

Architecture

tied embeddings

Kept the tied token embedding in fp16 because it also serves as the output head, reducing quantization loss.

parameters: {"tie_embeddings":1}

MLP hidden size

Reduced MLP hidden dimension to fit under the 16MB artifact limit.

parameters: {"mlp_hidden":992}

LR Schedule

warmdown

parameters: {"warmdown_steps":3600}

Other

other

Increased matrix learning rate to better match the short 10-minute training budget.

parameters: {"matrix_lr":0.06}