val_bpb
1.0809
Architecture
Transformer
Optimizer
—
Artifact Size
—
Training Techniques
Architecture
depth recurrence
3-layer recurrence used in the model stack
parameters: {"layers":3}
parallel residuals
Uses parallel residual connections in the stack
parameters: null
GQA
Uses QK-Gain 5.25 in the attention stack
parameters: {"qk_gain":5.25}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.005,"epochs":3}
Sequence Length
sequence_length
train_length: 8192
eval_length: null
Novel Contributions
- W104 SP8192 LegalTTT record candidate
- 3-seed replay with mean val_bpb 1.08089556
- SP8192 + 3-layer recurrence + parallel residuals + QK-Gain 5.25 + legal score-first TTT stack
- Faithful source-visible replay configuration
- No V7, V8, or V9 auxiliary data