PR #1725

open

Add near-SOTA SP8192 LegalTTT 3-seed reproduction

val_bpb

1.0813

Architecture

Transformer

Optimizer

—

Artifact Size

—

Training Techniques

Architecture

depth recurrence

Uses a 3-layer recurrence stack as part of the SP8192 model setup.

parameters: {"layers":3}

parallel residuals

Includes parallel residual connections in the model stack.

parameters: null

weight tying

Uses tied embeddings / embedding tying as part of the stack.

parameters: null

Test-Time Training

Legal TTT

parameters: null

Sequence Length

sequence_length

train_length: 8192

eval_length: null

Independent 3-seed reproduction of the SP8192 + QK-Gain 5.25 + Legal TTT stack
Reports per-seed validation bpb along with mean and population standard deviation
Positions the run as a near-SOTA reproducibility submission rather than a new record