PR #1043

open

PP12: Bayesian posterior packets + selective gating (1.1261 BPB)

val_bpb
1.1261
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
~15.99 MB

Training Techniques

Architecture
LeakyReLU
Uses LeakyReLU squared MLP activation in the base stack.
parameters: {"power":2}
Legal TTT
Score-first test-time training stack used in the base model.
parameters: null
BigramHash
Bigram hash component used in the base stack.
parameters: {"vocab_size":3072}
XSA
Cross/self-attention style component applied to the last layers.
parameters: {"layers":4}
Partial RoPE
Applies rotary position embeddings to a subset of dimensions.
parameters: {"dimensions":16,"total_dimensions":64}
VE128
Value embedding component used in later layers.
parameters: {"layers":[9,10]}
Weight Averaging
EMA
parameters: {"decay":0.997}
Tight SWA
parameters: {"interval":50}
Quantization
GPTQ-lite
bits: 6
scope: all
Compression
lzma
level: null
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: null
Test-Time Training
full TTT
parameters: {"epochs":3,"learning_rate":0.002,"momentum":0.9,"grad_clip":1}
Regularization
LN scale
parameters: {"formula":"1/sqrt(layer+1)"}
Other
other
Bayesian posterior packets distilled from training data and updated online with conjugate counts.
parameters: {"packet_store":true,"online_update":true}
other
Selective gating mixes packet posteriors with neural predictions only when packet confidence is higher.
parameters: {"confidence_margin":0.05,"has_data_threshold":20}

Novel Contributions

  • Bayesian posterior packets distilled from training data
  • Conjugate online updating of packet posteriors with eval-time counts
  • Selective gating to avoid degradation from naive probability mixing
  • Packet-based improvement over pure neural TTT
  • Periodic TTT reset idea to address drift during long evaluation