PR #1477

RECORDopen

Record: SP8192 + Parallel Residuals + Score-First TTT — val_bpb 1.0822 (3-seed mean)

by aryanbhosaleView on GitHub
val_bpb
1.0822
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.99 MB

Training Techniques

Architecture
depth recurrence
Uses recurrent looping over layers 4-5 to increase effective depth.
parameters: {"loop_start":4,"loop_end":5}
MLP4x
Uses a 4x-expanded MLP.
parameters: null
parallel residuals
From layer 7 onward, attention and MLP operate on separate residual lanes with a learned merge scalar.
parameters: {"start_layer":7}
Quantization
GPTQ
bits: null
scope: embeddings
Optimizer
Muon
weight_decay: null
momentum: null
other_params: null
Test-Time Training
score-first TTT
parameters: {"epochs":3,"learning_rate":0.005}
Evaluation
sliding window eval
parameters: null
Compression
brotli
level: null

Novel Contributions

  • Combines SP8192 with parallel residuals and score-first TTT.
  • Adds parallel residuals from layer 7 to the SP8192 + score-first TTT stack.
  • Achieves a new record val_bpb of 1.0822 as a 3-seed mean, improving over prior separate approaches.