PR #1477
RECORDopenRecord: SP8192 + Parallel Residuals + Score-First TTT — val_bpb 1.0822 (3-seed mean)
by aryanbhosaleView on GitHub
val_bpb
1.0822
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.99 MB
Training Techniques
Architecture
depth recurrence
Uses recurrent looping over layers 4-5 to increase effective depth.
parameters: {"loop_start":4,"loop_end":5}
MLP4x
Uses a 4x-expanded MLP.
parameters: null
parallel residuals
From layer 7 onward, attention and MLP operate on separate residual lanes with a learned merge scalar.
parameters: {"start_layer":7}
Quantization
GPTQ
bits: null
scope: embeddings
Optimizer
Muon
weight_decay: null
momentum: null
other_params: null
Test-Time Training
score-first TTT
parameters: {"epochs":3,"learning_rate":0.005}
Evaluation
sliding window eval
parameters: null
Compression
brotli
level: null
Novel Contributions
- Combines SP8192 with parallel residuals and score-first TTT.
- Adds parallel residuals from layer 7 to the SP8192 + score-first TTT stack.
- Achieves a new record val_bpb of 1.0822 as a 3-seed mean, improving over prior separate approaches.