PR #1628
openSP8192 Depth Recurrence + Parallel Residuals + TTT (1.1921 BPB)
by yu314-coderView on GitHub
val_bpb
1.1921
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.79 MB
Training Techniques
Architecture
depth recurrence
Layers 3-5 are shared and looped 3 times to create more virtual depth than physical layers.
parameters: {"physical_layers":11,"virtual_layers":17,"shared_layers":[3,4,5],"loops":3}
Parallel residuals
GPT-J style parallel residual connections used for later layers.
parameters: {"start_layer":7}
Partial RoPE
Only a subset of head dimensions use rotary embeddings; the rest remain position-free.
parameters: {"dimensions":16,"total_dimensions":64}
LeakyReLU
Uses LeakyReLU squared activation.
parameters: {"slope":0.5}
KV head count
Uses grouped key/value heads in the attention stack.
parameters: {"heads":8,"kv_heads":4}
U-Net skip connections
Encoder-decoder style skip connections with sigmoid-gated skip weights.
parameters: null
Weight Averaging
EMA
parameters: {"decay":0.9965,"start_fraction":0.5}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"scope":"matrices"}
Adam
weight_decay: null
momentum: null
other_params: {"scope":"embeddings/scalars"}
Compression
zlib
level: 9
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.005,"epochs":3,"momentum":0.9}
LR Schedule
warmdown
parameters: {"fraction":0.72}
Quantization
int8
bits: 8
scope: per-row
Sequence Length
sequence_length
train_length: 524288
eval_length: null
Novel Contributions
- SP8192 tokenizer for improved compression per byte
- Depth recurrence with 11 physical layers expanded to 17 virtual layers
- GPT-J style parallel residuals in later layers
- Partial RoPE applied to 16/64 head dimensions
- EMA with delayed start during training
- Score-first chunk-based test-time training
- Muon optimizer for matrix parameters with Adam for scalars
- Artifact compressed to fit under the 16MB limit