PR #830

open

Non-record: LeakyMixer: 11L leaky_relu(0.5)^2 + backoff n-gram mixer

val_bpb
1.4096
Architecture
Transformer
Optimizer
Artifact Size
13.49 MB

Training Techniques

Architecture
Transformer depth
Increased model depth from 9 to 11 layers.
parameters: {"layers":11}
MLP activation
Swapped relu^2 for leaky_relu(0.5)^2 in the MLP.
parameters: {"negative_slope":0.5}
weight tying
Uses tied token embeddings.
parameters: null
Compression
zlib
level: null
Evaluation
int8+zlib roundtrip eval
parameters: null
Test-Time Training
score-first TTT
parameters: {"backoff_orders":[1,2,3,4,5,6,7],"entropy_adaptive_alpha":true,"implemented_in_c":true}
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}

Novel Contributions

  • Swapped relu^2 for leaky_relu(0.5)^2
  • Increased model depth from 9 to 11 layers
  • Extended warmdown schedule to 3500 steps
  • Added a backoff n-gram mixer that runs at eval time
  • Built a token cache while scoring the validation set
  • Mixed neural logits with n-gram predictions using entropy-adaptive alpha
  • Implemented the n-gram mixer in C for speed