val_bpb
1.0813
Architecture
Transformer
Optimizer
—
Artifact Size
—
Training Techniques
Architecture
depth recurrence
Uses a 3-layer recurrence stack as part of the SP8192 model setup.
parameters: {"layers":3}
parallel residuals
Includes parallel residual connections in the model stack.
parameters: null
weight tying
Uses tied embeddings / embedding tying as part of the stack.
parameters: null
Test-Time Training
Legal TTT
parameters: null
Sequence Length
sequence_length
train_length: 8192
eval_length: null
Novel Contributions
- Independent 3-seed reproduction of the SP8192 + QK-Gain 5.25 + Legal TTT stack
- Reports per-seed validation bpb along with mean and population standard deviation
- Positions the run as a near-SOTA reproducibility submission rather than a new record