val_bpb
1.0585
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15,457,982 to 15,504,058 bytes
Training Techniques
Quantization
GPTQ
bits: 6
scope: all
Architecture
weight tying
Tied embeddings / embedding tying implied by canonical method mapping in the submission README.
parameters: null
depth recurrence
Looping architecture with repeated passes over selected layers.
parameters: {"layers":[4,5],"loops":2}
Gated Attention
Attention modified with QK gain / sharper attention behavior.
parameters: {"qk_gain_init":5.25}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Evaluation
sliding window eval
parameters: null
Test-Time Training
full TTT
parameters: {"learning_rate":0.00045,"epochs":10,"freeze_blocks":1}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Compression
brotli
level: null
Novel Contributions
- SAFE_SUBMISSION artifact staged from authoritative TensorPool pull rather than live-log heuristics
- Pre-quantization TTT baked into the artifact as a fixed predictor
- SP1024 tokenizer with looping architecture over layers 4-5
- TTT hyperparameter tuning with 10 epochs, lower LR, and fewer frozen blocks
- GPTQ int6 quantization with Brotli compression under the 16MB limit
- Explicit legality separation between submission score and frontier-only SLOT numbers