val_bpb
1.2168
Architecture
Transformer
Optimizer
—
Artifact Size
15.8 MB
Training Techniques
Architecture
MLP3x
Uses a 9-layer model with 2x MLP multiplier as part of the architecture tuning.
parameters: {"layers":9,"mlp_multiplier":2}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_steps":3600}
Novel Contributions
- Maintained a 9-layer, 2x MLP architecture
- Increased training sequence length to 2048
- Tuned matrix learning rate to 0.055
- Extended warmdown iterations to 3600
- Achieved a reported validation BPB of 1.2168 within a 15.8 MB artifact size