PR #1507

open

[RECORD] L-BFGS SLOT + Entropy-Adaptive N-gram Mixer (0.2282 BPB)

by ChideraIbe123View on GitHub
val_bpb
0.2282
Architecture
Transformer
Optimizer
L-BFGS
Artifact Size
~15.75 MB

Training Techniques

Optimizer
L-BFGS
weight_decay: null
momentum: null
other_params: {"history_size":10,"line_search":"strong Wolfe","steps":6}
Architecture
LeakyReLU
LeakyReLU squared MLP activation
parameters: {"negative_slope":0.5}
GQA
Grouped query attention with fewer KV heads than attention heads
parameters: {"heads":8,"kv_heads":4}
SmearGate
SmearGate component in the architecture
parameters: null
BigramHash
Bigram hash feature/module
parameters: null
XSA
XSA-all attention component
parameters: null
Weight Averaging
EMA + SWA
parameters: null
Quantization
GPTQ
bits: 6
scope: all
Compression
lzma
level: null
Test-Time Training
score-first SLOT
parameters: {"frozen_model":true,"no_grad_hidden_states":true}
Other
other
Entropy-adaptive n-gram mixer with target-independent mixing features (entropy, match order, context count)
parameters: {"order":12,"hash_buckets":4000000}

Novel Contributions

  • L-BFGS optimization for SLOT instead of AdamW
  • Entropy-adaptive n-gram mixer with target-independent features
  • Order-12 vectorized n-gram backoff with 4M hash buckets
  • Strong Wolfe line search with limited L-BFGS history
  • Combination of EMA, SWA, late QAT, GPTQ int6, and lzma compression