PR #1730

open

Non-record: QK4 Legal TTT Reproduction (1.08449 BPB)

val_bpb

1.0845

Architecture

Transformer

Optimizer

—

Artifact Size

15,985,765 bytes

Training Techniques

Quantization

GPTQ

bits: 4

scope: model weights

Weight Averaging

EMA

parameters: {"decay":0.997}

Evaluation

sliding window eval

parameters: {"enabled":true}

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.005,"epochs":3,"chunk_tokens":32768,"batch_seqs":32,"freeze_blocks":0,"eval_stride":64}

Sequence Length

sequence_length

train_length: 786432

eval_length: 524288

Regularization

weight decay

parameters: null

Other

other

QK gain initialization and matrix clipping used in the SP8192 training stack

parameters: {"qk_gain_init":4,"matrix_clip_sigmas":12.86}

Non-record reproduction of an 8xH100 SP8192 QK4 legal score-first TTT run
Legal TTT evaluation adapted from the April 6 QK5 record with score-first ordering
End-to-end reproduction of training, GPTQ/SDClip export, sliding-window validation, and legal TTT under the artifact cap
Provided reproducible scripts, metadata, and logs for the SP8192 record-family milestone