PR #1289
openRecord: PROTEUS v1.6 — Scylla + Parallel Residuals + Depth Recurrence + Legal TTT — val_bpb 1.0819 (3-seed mean)
by MatoTeziTankaView on GitHub
val_bpb
1.0819
Architecture
Transformer
Optimizer
Muon
Artifact Size
16 MB
Training Techniques
Quantization
mixed int5/int6
bits: null
scope: MLP and attention layers
QAT
bits: null
scope: model weights
Architecture
Parallel Residuals
Separate attention and MLP residual lanes with learnable mixing.
parameters: {"from_layer":7}
depth recurrence
Mini depth recurrence applied to middle layers during training.
parameters: {"layers":[4,5],"start_step":3000}
XSA
Exclusive Self-Attention used in the last layers.
parameters: {"last_layers":4}
BigramHash
Hash-based bigram embeddings.
parameters: null
SmearGate
Gating mechanism used with hashed bigram features.
parameters: null
VE128
Value embeddings on deeper layers.
parameters: {"layers":[9,10],"dimensions":128}
Partial RoPE
Rotary position embeddings applied partially.
parameters: {"dimensions":16}
LeakyReLU
Squared LeakyReLU activation in the MLP.
parameters: {"squared":true,"negative_slope":0.5}
Weight Averaging
EMA
parameters: {"decay":0.997}
Compression
lzma
level: null
Test-Time Training
score-first TTT
parameters: null
Regularization
LN scale
parameters: null
Novel Contributions
- Sensitivity-driven mixed INT5/INT6 quantization with per-layer routing
- Learnable lane merge and per-dimension resid_mix_mlp for parallel residual streams
- Scylla retokenization pipeline for converting SP1024 FineWeb shards to the Scylla vocabulary
- Integration of parallel residuals, depth recurrence, legal TTT, and Scylla tokenizer into one training run
- CPU end-to-end test suite for pre-flight validation
- Controlled LeakyReLU slope sweep and follow-up A/B testing on the parallel residual architecture