val_bpb
1.0845
Architecture
Transformer
Optimizer
—
Artifact Size
15,985,765 bytes
Training Techniques
Quantization
GPTQ
bits: 4
scope: model weights
Weight Averaging
EMA
parameters: {"decay":0.997}
Evaluation
sliding window eval
parameters: {"enabled":true}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.005,"epochs":3,"chunk_tokens":32768,"batch_seqs":32,"freeze_blocks":0,"eval_stride":64}
Sequence Length
sequence_length
train_length: 786432
eval_length: 524288
Regularization
weight decay
parameters: null
Other
other
QK gain initialization and matrix clipping used in the SP8192 training stack
parameters: {"qk_gain_init":4,"matrix_clip_sigmas":12.86}
Novel Contributions
- Non-record reproduction of an 8xH100 SP8192 QK4 legal score-first TTT run
- Legal TTT evaluation adapted from the April 6 QK5 record with score-first ordering
- End-to-end reproduction of training, GPTQ/SDClip export, sliding-window validation, and legal TTT under the artifact cap
- Provided reproducible scripts, metadata, and logs for the SP8192 record-family milestone