val_bpb
1.4784
Architecture
Transformer
Optimizer
—
Artifact Size
—
Training Techniques
Architecture
MLP4
Increased MLP multiplier from 2 to 4, expanding model capacity.
parameters: {"mlp_mult":4}
RoPE
Raised RoPE base from 10,000 to 100,000.
parameters: {"rope_base":100000}
LR Schedule
warmdown
parameters: {"warmdown_iters":600}
Other
other
Reduced training batch tokens from 196,608 to 98,304 to improve optimization efficiency and fit track limits.
parameters: {"train_batch_tokens":98304}
other
Lowered matrix and scalar learning rates from 0.04 to 0.035.
parameters: {"matrix_lr":0.035,"scalar_lr":0.035}
Novel Contributions
- Reduced training batch tokens to 98,304
- Increased MLP multiplier from 2 to 4
- Lowered matrix and scalar learning rates to 0.035
- Shortened warmdown from 800 to 600 iterations
- Raised RoPE base from 10,000 to 100,000
- Achieved 1.4784 val_bpb on a small GPU within 10 minutes