PR #806

open

Record: Backoff N-gram Cache + LeakyReLU(0.9)² (val_bpb=0.6678)

by ibarrajoView on GitHub
val_bpb
0.6678
Architecture
Transformer
Optimizer
Muon
Artifact Size
8.6MB

Training Techniques

Architecture
SmearGate
Added SmearGate to the Transformer architecture.
parameters: null
BigramHash
Added a BigramHash component with a vocabulary size of 2048.
parameters: {"vocab_size":2048}
MLP3x
Uses a 3x MLP expansion in the Transformer.
parameters: null
tied embeddings
Uses tied embeddings / weight tying.
parameters: null
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"warmup_start":0.92,"warmup_steps":1500}
AdamW
weight_decay: 0.04
momentum: null
other_params: null
Evaluation
sliding window eval
parameters: {"stride":64}
multi-order backoff n-gram cache
parameters: {"orders":[2,3,4,5,6,7],"entropy_adaptive_alpha":true,"score_first":true,"min_count":2}
distributed cache pre-fill
parameters: {"multi_gpu":true,"rank":7,"prefill_tokens":54000000,"prefill_time_seconds":68}
Other
other
LeakyReLU(0.9)^2 activation replacing relu^2.
parameters: {"slope":0.9}
other
Entropy-adaptive alpha mixing between model softmax and n-gram cache probabilities.
parameters: {"alpha_formula":"0.05 + 0.55 * sigmoid(2.0 * (H - 4.0))"}

Novel Contributions

  • Multi-order backoff n-gram eval cache with orders 2-7
  • Entropy-adaptive alpha mixing between neural predictions and n-gram cache probabilities
  • Distributed cache pre-fill for multi-GPU coherence
  • LeakyReLU(0.9)^2 activation replacing relu^2
  • Score-first legality: scoring every token under inference_mode before cache update
  • Removal of illegal pre-eval test-time training