PR #1326

open

Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + Legal TTT — val_bpb 1.0896 (3-seed mean)

by aryanbhosaleView on GitHub

val_bpb

1.0896

Architecture

Transformer

Optimizer

MuonEq-R

Artifact Size

~15.99 MB

Training Techniques

Architecture

depth recurrence

Recurrence applied to selected layers during training/inference.

parameters: {"layers":[4,5]}

parallel residuals

Parallel residual pathway introduced starting from a later layer.

parameters: {"start_layer":7}

MLP4x

Expanded MLP width to 4x.

parameters: null

Regularization

weight decay

parameters: {"value":0.09}

Optimizer

MuonEq-R

weight_decay: null

momentum: null

other_params: null

Quantization

GPTQ

bits: 6

scope: all

Compression

brotli + lzma

level: null

Test-Time Training

score-first TTT

parameters: {"enabled":true,"learning_rate":0.002,"epochs":3,"chunk_tokens":32768,"freeze_blocks":0}

Evaluation

sliding window eval

parameters: null

Sequence Length

sequence_length

train_length: 4096

eval_length: null

Novel Contributions

SP4096 setup with 4096-vocab and widened MLP
Depth recurrence on layers 4 and 5
Parallel residuals starting from layer 7
MuonEq-R optimizer
QK-Gain 5.0
Legal score-first test-time training
GPTQ int6 model compression with Brotli/LZMA wrapper