PR #618
closedexperiments: MODEL_DIM=256, MLP_MULT=3, WARMDOWN fix - best bpb 1.4702
by 0xtigerclawView on GitHub
val_bpb
1.4702
Architecture
—
Optimizer
—
Artifact Size
6.4MB
Training Techniques
Architecture
MLP3x
Increased MLP multiplier from the default 2 to 3.
parameters: {"mlp_mult":3}
weight tying
Explored layer tying as a possible way to fit within remaining size headroom; noted as not yet applied in the reported best run.
parameters: null
LR Schedule
warmdown
parameters: {"warmdown_iters":null,"constraint":"must fit within actual step count"}
Novel Contributions
- Found that MODEL_DIM=256 with MLP_MULT=3 and warmdown fix achieved the best reported score of 1.4702 bpb.
- Observed that wider models lost under the time budget, while deeper models without tying performed poorly.
- Identified that step speed matters more than model size on the available time budget.
- Noted that WARMDOWN_ITERS must fit within the actual step count.
- Suggested remaining artifact headroom for potential layer tying.