PR #1159

open

Add non-record 16MB submission: Dirichlet PPM + Legal TTT on 8xH100

by JDAppleseedView on GitHub

val_bpb

0.3693

Architecture

Transformer

Optimizer

Muon

Artifact Size

10,176,408 bytes

Training Techniques

Architecture

weight tying

Tied input and output embeddings.

parameters: null

Gated Attention

Gated attention disabled in this run.

parameters: {"enabled":0}

Value Residual

Value residual disabled in this run.

parameters: {"enabled":0}

VE128

Value embedding / VE stack enabled with 128-dimensional VE.

parameters: {"dim":128,"layers":[9,10]}

BigramHash

Bigram hash embedding component used.

parameters: {"dim":128,"vocab_size":1536}

Weight Averaging

SWA

parameters: {"enabled":1,"every":50}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"warmup_start":0.92,"warmup_steps":1500}

AdamW

weight_decay: 0.04

momentum: null

other_params: null

Evaluation

sliding window eval

parameters: {"stride":64,"seq_len":2048}

Test-Time Training

score-first TTT

parameters: {"enabled":1,"epochs":3,"batch_seqs":32,"chunk_tokens":32768,"learning_rate":0.002,"momentum":0.9}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Regularization

weight decay

parameters: {"adamw":0.04,"muon":0.04}

LN scale

parameters: {"enabled":1}

Other

other

Dirichlet posterior predictive PPM cache mixing over orders 2..7 using current model probability as the base prior.

parameters: {"cache_mode":"ppm","max_order":7,"mixing":"dirichlet","alpha":0.3,"count_smoothing":4}

Novel Contributions

Dirichlet PPM cache mixing for posterior predictive backoff over PPM orders 2..7
Score-first causal cache updates using only previously committed counts plus current model probability
Validation of distributed exact-eval path for cache-enabled post-train evaluation on 8xH100
Legal test-time training (TTT) combined with exact sliding-window evaluation in a non-record run