PR #408

open

First submission

by markste-inView on GitHub
val_bpb
1.4784
Architecture
Transformer
Optimizer
Artifact Size

Training Techniques

Architecture
MLP4
Increased MLP multiplier from 2 to 4, expanding model capacity.
parameters: {"mlp_mult":4}
RoPE
Raised RoPE base from 10,000 to 100,000.
parameters: {"rope_base":100000}
LR Schedule
warmdown
parameters: {"warmdown_iters":600}
Other
other
Reduced training batch tokens from 196,608 to 98,304 to improve optimization efficiency and fit track limits.
parameters: {"train_batch_tokens":98304}
other
Lowered matrix and scalar learning rates from 0.04 to 0.035.
parameters: {"matrix_lr":0.035,"scalar_lr":0.035}

Novel Contributions

  • Reduced training batch tokens to 98,304
  • Increased MLP multiplier from 2 to 4
  • Lowered matrix and scalar learning rates to 0.035
  • Shortened warmdown from 800 to 600 iterations
  • Raised RoPE base from 10,000 to 100,000
  • Achieved 1.4784 val_bpb on a small GPU within 10 minutes