PR #1274

closed

Record: Scylla + Parallel Residuals + Depth Recurrence + Legal TTT — val_bpb 1.0876 (3-seed mean)

by MatoTeziTankaView on GitHub

val_bpb

1.0876

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

15.83 MB

Training Techniques

Architecture

LeakyReLU

Uses LeakyReLU(0.5)^2 activation in the transformer MLPs.

parameters: {"negative_slope":0.5}

XSA

Applies XSA only to the last 4 layers.

parameters: {"layers":4}

SmearGate

Uses gated previous-token blending.

parameters: null

BigramHash

Adds a bigram hash embedding module.

parameters: {"vocab_size":2048,"dimension":128}

depth recurrence

Repeats layers 4 and 5 once each with untied MLPs.

parameters: {"layers":[4,5]}

parallel residuals

Splits residual stream into parallel lanes starting from layer 7 with learned routing scalars.

parameters: {"start_layer":7}

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.005,"epochs":3}

Quantization

mixed int5/int6

bits: null

scope: block weights

Compression

brotli

level: 11

Optimizer

Parallel Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}

Weight Averaging

EMA + SWA

parameters: {"ema_decay":0.997,"swa_every":50}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

Regularization

layerwise LN scale

parameters: {"ln_scale":1}

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Novel Contributions

Scylla tokenizer integration with a 998-token TokenMonster vocabulary
Parallel residual routing starting from layer 7 with learned 4-scalar routing
Mini depth recurrence on layers 4 and 5 with untied MLPs
Legal score-first TTT applied after scoring each chunk under inference mode
Mixed INT5/INT6 per-row quantization with brotli-11 compression
Learnable lane merge for parallel residuals