PR #618

closed

experiments: MODEL_DIM=256, MLP_MULT=3, WARMDOWN fix - best bpb 1.4702

by 0xtigerclawView on GitHub
val_bpb
1.4702
Architecture
Optimizer
Artifact Size
6.4MB

Training Techniques

Architecture
MLP3x
Increased MLP multiplier from the default 2 to 3.
parameters: {"mlp_mult":3}
weight tying
Explored layer tying as a possible way to fit within remaining size headroom; noted as not yet applied in the reported best run.
parameters: null
LR Schedule
warmdown
parameters: {"warmdown_iters":null,"constraint":"must fit within actual step count"}

Novel Contributions

  • Found that MODEL_DIM=256 with MLP_MULT=3 and warmdown fix achieved the best reported score of 1.4702 bpb.
  • Observed that wider models lost under the time budget, while deeper models without tying performed poorly.
  • Identified that step speed matters more than model size on the available time budget.
  • Noted that WARMDOWN_ITERS must fit within the actual step count.
  • Suggested remaining artifact headroom for potential layer tying.