PR #798

open

Record: Order-Adaptive Entropy Gating + BackoffNgramMixer (val_bpb=0.5466)

by travispchenView on GitHub
val_bpb
0.5466
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.99 MB

Training Techniques

Quantization
int6
bits: 6
scope: all
Architecture
XSA
Uses XSA in all 11 layers as part of the base model stack.
parameters: {"layers":11}
MLP3x
Three MLP blocks with LeakyReLU(0.5)^2.
parameters: {"mlp_blocks":3}
GQA
Grouped-query attention setting with no grouping.
parameters: {"kv_heads":8,"query_heads":8}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"adamw_for":"embeddings/scalars"}
Weight Averaging
EMA + SWA
parameters: null
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: null
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.00003,"epochs":1,"chunk_tokens":1000000,"freeze_blocks":2,"polyak_decay":0.998}
Regularization
pruning
parameters: {"sparsity":0.03,"type":"magnitude"}
Other
other
Backoff n-gram mixer with entropy-adaptive alpha mixing across orders 2-7.
parameters: {"orders":[2,3,4,5,6,7]}
other
Order-adaptive entropy gating using per-order entropy centers for n-gram mixing.
parameters: {"entropy_centers":{"2":4.5,"3":4.2,"4":3.8,"5":3.5,"6":3.2,"7":3}}
other
Drift-free test-time training with logistic context mixer.
parameters: {"eta":0.1}

Novel Contributions

  • Order-adaptive entropy gating with per-n-gram-order entropy centers
  • BackoffNgramMixer combining n-gram predictions with neural predictions
  • Drift-free score-first test-time training
  • Entropy-adaptive alpha mixing across n-gram orders