val_bpb
1.1261
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
~15.99 MB
Training Techniques
Architecture
LeakyReLU
Uses LeakyReLU squared MLP activation in the base stack.
parameters: {"power":2}
Legal TTT
Score-first test-time training stack used in the base model.
parameters: null
BigramHash
Bigram hash component used in the base stack.
parameters: {"vocab_size":3072}
XSA
Cross/self-attention style component applied to the last layers.
parameters: {"layers":4}
Partial RoPE
Applies rotary position embeddings to a subset of dimensions.
parameters: {"dimensions":16,"total_dimensions":64}
VE128
Value embedding component used in later layers.
parameters: {"layers":[9,10]}
Weight Averaging
EMA
parameters: {"decay":0.997}
Tight SWA
parameters: {"interval":50}
Quantization
GPTQ-lite
bits: 6
scope: all
Compression
lzma
level: null
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: null
Test-Time Training
full TTT
parameters: {"epochs":3,"learning_rate":0.002,"momentum":0.9,"grad_clip":1}
Regularization
LN scale
parameters: {"formula":"1/sqrt(layer+1)"}
Other
other
Bayesian posterior packets distilled from training data and updated online with conjugate counts.
parameters: {"packet_store":true,"online_update":true}
other
Selective gating mixes packet posteriors with neural predictions only when packet confidence is higher.
parameters: {"confidence_margin":0.05,"has_data_threshold":20}
Novel Contributions
- Bayesian posterior packets distilled from training data
- Conjugate online updating of packet posteriors with eval-time counts
- Selective gating to avoid degradation from naive probability mixing
- Packet-based improvement over pure neural TTT
- Periodic TTT reset idea to address drift during long evaluation