PR #1043

open

PP12: Bayesian posterior packets + selective gating (1.1261 BPB)

by okezueView on GitHub

val_bpb

1.1261

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

~15.99 MB

Training Techniques

Architecture

LeakyReLU

Uses LeakyReLU squared MLP activation in the base stack.

parameters: {"power":2}

Legal TTT

Score-first test-time training stack used in the base model.

parameters: null

BigramHash

Bigram hash component used in the base stack.

parameters: {"vocab_size":3072}

XSA

Cross/self-attention style component applied to the last layers.

parameters: {"layers":4}

Partial RoPE

Applies rotary position embeddings to a subset of dimensions.

parameters: {"dimensions":16,"total_dimensions":64}

VE128

Value embedding component used in later layers.

parameters: {"layers":[9,10]}

Weight Averaging

EMA

parameters: {"decay":0.997}

Tight SWA

parameters: {"interval":50}

Quantization

GPTQ-lite

bits: 6

scope: all

Compression

lzma

level: null

Optimizer

Parallel Muon

weight_decay: null

momentum: null

other_params: null

Test-Time Training

full TTT

parameters: {"epochs":3,"learning_rate":0.002,"momentum":0.9,"grad_clip":1}

Regularization

LN scale

parameters: {"formula":"1/sqrt(layer+1)"}

Other

other

Bayesian posterior packets distilled from training data and updated online with conjugate counts.

parameters: {"packet_store":true,"online_update":true}

other

Selective gating mixes packet posteriors with neural predictions only when packet confidence is higher.

parameters: {"confidence_margin":0.05,"has_data_threshold":20}

Novel Contributions

Bayesian posterior packets distilled from training data
Conjugate online updating of packet posteriors with eval-time counts
Selective gating to avoid degradation from naive probability mixing
Packet-based improvement over pure neural TTT
Periodic TTT reset idea to address drift during long evaluation