val_bpb
1.0742
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,985,150 bytes
Training Techniques
Architecture
depth recurrence
Uses recurrent depth structure in the model.
parameters: null
parallel residuals
Uses parallel residual connections.
parameters: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: null
Quantization
GPTQ
bits: 6
scope: matrices
GPTQ
bits: 7
scope: embeddings
Compression
Brotli
level: null
Evaluation
sliding window eval
parameters: null
Test-Time Training
TTT
parameters: {"enabled":false}
Novel Contributions
- Controlled one-seed ablation changing QK gain from 5.0 to 5.125
- Lowercase SP10240 tokenizer setup
- Maintains depth recurrence, parallel residuals, Muon training, GPTQ INT6 matrices, INT7 embeddings, and Brotli compression
- Reports a negative ablation result with slightly worse BPB than the 5.0 baseline