PR #1274

closed

Record: Scylla + Parallel Residuals + Depth Recurrence + Legal TTT — val_bpb 1.0876 (3-seed mean)

by MatoTeziTankaView on GitHub
val_bpb
1.0876
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
15.83 MB

Training Techniques

Architecture
LeakyReLU
Uses LeakyReLU(0.5)^2 activation in the transformer MLPs.
parameters: {"negative_slope":0.5}
XSA
Applies XSA only to the last 4 layers.
parameters: {"layers":4}
SmearGate
Uses gated previous-token blending.
parameters: null
BigramHash
Adds a bigram hash embedding module.
parameters: {"vocab_size":2048,"dimension":128}
depth recurrence
Repeats layers 4 and 5 once each with untied MLPs.
parameters: {"layers":[4,5]}
parallel residuals
Splits residual stream into parallel lanes starting from layer 7 with learned routing scalars.
parameters: {"start_layer":7}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.005,"epochs":3}
Quantization
mixed int5/int6
bits: null
scope: block weights
Compression
brotli
level: 11
Optimizer
Parallel Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}
Weight Averaging
EMA + SWA
parameters: {"ema_decay":0.997,"swa_every":50}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Regularization
layerwise LN scale
parameters: {"ln_scale":1}
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}

Novel Contributions

  • Scylla tokenizer integration with a 998-token TokenMonster vocabulary
  • Parallel residual routing starting from layer 7 with learned 4-scalar routing
  • Mini depth recurrence on layers 4 and 5 with untied MLPs
  • Legal score-first TTT applied after scoring each chunk under inference mode
  • Mixed INT5/INT6 per-row quantization with brotli-11 compression
  • Learnable lane merge for parallel residuals