PR #1814

open

Add lowercase SP10240 QK 5.125 ablation

val_bpb

1.0742

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,985,150 bytes

Training Techniques

Architecture

depth recurrence

Uses recurrent depth structure in the model.

parameters: null

parallel residuals

Uses parallel residual connections.

parameters: null

Optimizer

Muon

weight_decay: null

momentum: null

other_params: null

Quantization

GPTQ

bits: 6

scope: matrices

GPTQ

bits: 7

scope: embeddings

Compression

Brotli

level: null

Evaluation

sliding window eval

parameters: null

Test-Time Training

TTT

parameters: {"enabled":false}

Controlled one-seed ablation changing QK gain from 5.0 to 5.125
Lowercase SP10240 tokenizer setup
Maintains depth recurrence, parallel residuals, Muon training, GPTQ INT6 matrices, INT7 embeddings, and Brotli compression
Reports a negative ablation result with slightly worse BPB than the 5.0 baseline