PR #1242

open

Record: Scylla + n-gram + legal TTT — val_bpb 1.0903 (3-seed mean)

by CampbellbView on GitHub
val_bpb
1.0903
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
15.4 MB

Training Techniques

Architecture
BigramHash
Uses BigramHash embeddings with a 10240-token vocabulary.
parameters: {"vocab_size":10240,"dimensions":128}
SmearGate
Smear gate is enabled in the model.
parameters: null
weight tying
Tied embeddings are used.
parameters: null
Quantization
int6
bits: 6
scope: all
Compression
lzma
level: null
Optimizer
Parallel Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"matrix_lr":0.025,"scalar_lr":0.025,"tied_embed_lr":0.035}
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"interval":50}
Initialization
OrthoInit
Orthogonal initialization is enabled.
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.005,"epochs":3,"chunk_tokens":32768}
Evaluation
two-pass eval
parameters: {"ngram_max_order":16}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Regularization
LN scale
parameters: {"enabled":true}

Novel Contributions

  • Scylla tokenizer retokenization and training with a ~998-token TokenMonster vocabulary
  • N-gram rescoring with orders 2-16 using a two-pass evaluation setup
  • Legal score-first TTT that computes BPB before adaptation
  • Tuned int6 quantization with clip_range=20 for stable artifact size across seeds
  • Parallel Muon optimization with cyclic shared blocks and BigramHash embeddings