PR #1326
openRecord: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + Legal TTT — val_bpb 1.0896 (3-seed mean)
by aryanbhosaleView on GitHub
val_bpb
1.0896
Architecture
Transformer
Optimizer
MuonEq-R
Artifact Size
~15.99 MB
Training Techniques
Architecture
depth recurrence
Recurrence applied to selected layers during training/inference.
parameters: {"layers":[4,5]}
parallel residuals
Parallel residual pathway introduced starting from a later layer.
parameters: {"start_layer":7}
MLP4x
Expanded MLP width to 4x.
parameters: null
Regularization
weight decay
parameters: {"value":0.09}
Optimizer
MuonEq-R
weight_decay: null
momentum: null
other_params: null
Quantization
GPTQ
bits: 6
scope: all
Compression
brotli + lzma
level: null
Test-Time Training
score-first TTT
parameters: {"enabled":true,"learning_rate":0.002,"epochs":3,"chunk_tokens":32768,"freeze_blocks":0}
Evaluation
sliding window eval
parameters: null
Sequence Length
sequence_length
train_length: 4096
eval_length: null
Novel Contributions
- SP4096 setup with 4096-vocab and widened MLP
- Depth recurrence on layers 4 and 5
- Parallel residuals starting from layer 7
- MuonEq-R optimizer
- QK-Gain 5.0
- Legal score-first test-time training
- GPTQ int6 model compression with Brotli/LZMA wrapper