PR #1289

open

Record: PROTEUS v1.6 — Scylla + Parallel Residuals + Depth Recurrence + Legal TTT — val_bpb 1.0819 (3-seed mean)

by MatoTeziTankaView on GitHub

val_bpb

1.0819

Architecture

Transformer

Optimizer

Muon

Artifact Size

16 MB

Training Techniques

Quantization

mixed int5/int6

bits: null

scope: MLP and attention layers

QAT

bits: null

scope: model weights

Architecture

Parallel Residuals

Separate attention and MLP residual lanes with learnable mixing.

parameters: {"from_layer":7}

depth recurrence

Mini depth recurrence applied to middle layers during training.

parameters: {"layers":[4,5],"start_step":3000}

XSA

Exclusive Self-Attention used in the last layers.

parameters: {"last_layers":4}

BigramHash

Hash-based bigram embeddings.

parameters: null

SmearGate

Gating mechanism used with hashed bigram features.

parameters: null

VE128

Value embeddings on deeper layers.

parameters: {"layers":[9,10],"dimensions":128}

Partial RoPE

Rotary position embeddings applied partially.

parameters: {"dimensions":16}

LeakyReLU

Squared LeakyReLU activation in the MLP.

parameters: {"squared":true,"negative_slope":0.5}

Weight Averaging

EMA

parameters: {"decay":0.997}

Compression

lzma

level: null

Test-Time Training

score-first TTT

parameters: null

Regularization

LN scale

parameters: null

Novel Contributions

Sensitivity-driven mixed INT5/INT6 quantization with per-layer routing
Learnable lane merge and per-dimension resid_mix_mlp for parallel residual streams
Scylla retokenization pipeline for converting SP1024 FineWeb shards to the Scylla vocabulary
Integration of parallel residuals, depth recurrence, legal TTT, and Scylla tokenizer into one training run
CPU end-to-end test suite for pre-flight validation
Controlled LeakyReLU slope sweep and follow-up A/B testing on the parallel residual architecture