PR #1242

open

Record: Scylla + n-gram + legal TTT — val_bpb 1.0903 (3-seed mean)

by CampbellbView on GitHub

val_bpb

1.0903

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

15.4 MB

Training Techniques

Architecture

BigramHash

Uses BigramHash embeddings with a 10240-token vocabulary.

parameters: {"vocab_size":10240,"dimensions":128}

SmearGate

Smear gate is enabled in the model.

parameters: null

weight tying

Tied embeddings are used.

parameters: null

Quantization

int6

bits: 6

scope: all

Compression

lzma

level: null

Optimizer

Parallel Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"matrix_lr":0.025,"scalar_lr":0.025,"tied_embed_lr":0.035}

Weight Averaging

EMA

parameters: {"decay":0.997}

SWA

parameters: {"interval":50}

Initialization

OrthoInit

Orthogonal initialization is enabled.

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.005,"epochs":3,"chunk_tokens":32768}

Evaluation

two-pass eval

parameters: {"ngram_max_order":16}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Regularization

LN scale

parameters: {"enabled":true}

Novel Contributions

Scylla tokenizer retokenization and training with a ~998-token TokenMonster vocabulary
N-gram rescoring with orders 2-16 using a two-pass evaluation setup
Legal score-first TTT that computes BPB before adaptation
Tuned int6 quantization with clip_range=20 for stable artifact size across seeds
Parallel Muon optimization with cyclic shared blocks and BigramHash embeddings