PR #830
openNon-record: LeakyMixer: 11L leaky_relu(0.5)^2 + backoff n-gram mixer
by zlxi02View on GitHub
val_bpb
1.4096
Architecture
Transformer
Optimizer
—
Artifact Size
13.49 MB
Training Techniques
Architecture
Transformer depth
Increased model depth from 9 to 11 layers.
parameters: {"layers":11}
MLP activation
Swapped relu^2 for leaky_relu(0.5)^2 in the MLP.
parameters: {"negative_slope":0.5}
weight tying
Uses tied token embeddings.
parameters: null
Compression
zlib
level: null
Evaluation
int8+zlib roundtrip eval
parameters: null
Test-Time Training
score-first TTT
parameters: {"backoff_orders":[1,2,3,4,5,6,7],"entropy_adaptive_alpha":true,"implemented_in_c":true}
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Novel Contributions
- Swapped relu^2 for leaky_relu(0.5)^2
- Increased model depth from 9 to 11 layers
- Extended warmdown schedule to 3500 steps
- Added a backoff n-gram mixer that runs at eval time
- Built a token cache while scoring the validation set
- Mixed neural logits with n-gram predictions using entropy-adaptive alpha
- Implemented the n-gram mixer in C for speed