PR #1750

open

Add W104 SP8192 LegalTTT record candidate

val_bpb

1.0809

Architecture

Transformer

Optimizer

—

Artifact Size

—

Training Techniques

Architecture

depth recurrence

3-layer recurrence used in the model stack

parameters: {"layers":3}

parallel residuals

Uses parallel residual connections in the stack

parameters: null

GQA

Uses QK-Gain 5.25 in the attention stack

parameters: {"qk_gain":5.25}

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.005,"epochs":3}

Sequence Length

sequence_length

train_length: 8192

eval_length: null

W104 SP8192 LegalTTT record candidate
3-seed replay with mean val_bpb 1.08089556
SP8192 + 3-layer recurrence + parallel residuals + QK-Gain 5.25 + legal score-first TTT stack
Faithful source-visible replay configuration
No V7, V8, or V9 auxiliary data