PR #618

closed

experiments: MODEL_DIM=256, MLP_MULT=3, WARMDOWN fix - best bpb 1.4702

val_bpb

1.4702

Architecture

—

Optimizer

—

Artifact Size

6.4MB

Training Techniques

Architecture

MLP3x

Increased MLP multiplier from the default 2 to 3.

parameters: {"mlp_mult":3}

weight tying

Explored layer tying as a possible way to fit within remaining size headroom; noted as not yet applied in the reported best run.

parameters: null

LR Schedule

warmdown

parameters: {"warmdown_iters":null,"constraint":"must fit within actual step count"}

Found that MODEL_DIM=256 with MLP_MULT=3 and warmdown fix achieved the best reported score of 1.4702 bpb.
Observed that wider models lost under the time budget, while deeper models without tying performed poorly.
Identified that step speed matters more than model size on the available time budget.
Noted that WARMDOWN_ITERS must fit within the actual step count.
Suggested remaining artifact headroom for potential layer tying.