PR #1159

open

Add non-record 16MB submission: Dirichlet PPM + Legal TTT on 8xH100

by JDAppleseedView on GitHub
val_bpb
0.3693
Architecture
Transformer
Optimizer
Muon
Artifact Size
10,176,408 bytes

Training Techniques

Architecture
weight tying
Tied input and output embeddings.
parameters: null
Gated Attention
Gated attention disabled in this run.
parameters: {"enabled":0}
Value Residual
Value residual disabled in this run.
parameters: {"enabled":0}
VE128
Value embedding / VE stack enabled with 128-dimensional VE.
parameters: {"dim":128,"layers":[9,10]}
BigramHash
Bigram hash embedding component used.
parameters: {"dim":128,"vocab_size":1536}
Weight Averaging
SWA
parameters: {"enabled":1,"every":50}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"warmup_start":0.92,"warmup_steps":1500}
AdamW
weight_decay: 0.04
momentum: null
other_params: null
Evaluation
sliding window eval
parameters: {"stride":64,"seq_len":2048}
Test-Time Training
score-first TTT
parameters: {"enabled":1,"epochs":3,"batch_seqs":32,"chunk_tokens":32768,"learning_rate":0.002,"momentum":0.9}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Regularization
weight decay
parameters: {"adamw":0.04,"muon":0.04}
LN scale
parameters: {"enabled":1}
Other
other
Dirichlet posterior predictive PPM cache mixing over orders 2..7 using current model probability as the base prior.
parameters: {"cache_mode":"ppm","max_order":7,"mixing":"dirichlet","alpha":0.3,"count_smoothing":4}

Novel Contributions

  • Dirichlet PPM cache mixing for posterior predictive backoff over PPM orders 2..7
  • Score-first causal cache updates using only previously committed counts plus current model probability
  • Validation of distributed exact-eval path for cache-enabled post-train evaluation on 8xH100
  • Legal test-time training (TTT) combined with exact sliding-window evaluation in a non-record run