PR #1310

open

Non-record: Exact Sequence Matching + TTT on PR #549 (1.1177 BPB)

by cadenmcmannView on GitHub
val_bpb
1.1177
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
15,882,529 bytes

Training Techniques

Evaluation
sliding window eval
parameters: {"stride":64}
exact sequence matching
parameters: {"min_order":8,"max_order":12,"lambda":0.15,"blend_cap":0.5}
Test-Time Training
full TTT
parameters: {"learning_rate":0.002,"epochs":3,"all_blocks_unfrozen":true}
Architecture
LeakyReLU
MLP activation uses LeakyReLU(0.5)^2
parameters: {"negative_slope":0.5}
GQA
Uses grouped query attention with fewer KV heads than query heads
parameters: {"heads":8,"kv_heads":4}
XSA
XSA attention on the last 4 layers
parameters: {"layers":4}
BigramHash
Bigram hash embedding / vocabulary component
parameters: {"size":1536}
Compression
lzma
level: null
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: {"parameter_banking":true}
Weight Averaging
EMA
parameters: null
Sequence Length
sequence_length
train_length: null
eval_length: null

Novel Contributions

  • Exact eval-time N-gram sequence matching stacked on top of TTT
  • Demonstration that sequence matching and TTT are complementary
  • Improvement from 1.1195 BPB with TTT alone to 1.1177 BPB with sequence matching
  • Use of exact 8-12 token context caching to mix cached next-token predictions into model outputs