val_bpb
1.0849
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,962,961 bytes
Training Techniques
Quantization
GPTQ
bits: 6
scope: MLP, attention, embeddings
Architecture
weight tying
Tied input and output embeddings.
parameters: null
Partial RoPE
Uses rotary position embeddings on a subset of dimensions.
parameters: {"dimensions":"16/64"}
depth recurrence
Reuses layers in a recurrent encoder/decoder pattern.
parameters: {"layers":"3-5"}
ReLU²
Uses squared ReLU activation in the MLP.
parameters: null
XSA
Applies XSA in the last layers.
parameters: {"layers":11}
Gated Attention
Adds per-head attention output gating with zero-init identity.
parameters: {"gate":"2σ(attn_gate)"}
Regularization
logit softcap
parameters: {"value":30}
layerwise LN scale
parameters: null
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"warmup_start_momentum":0.92}
AdamW
weight_decay: null
momentum: null
other_params: {"scope":"embeddings + scalars"}
Test-Time Training
score-first TTT
parameters: {"freeze_blocks":0,"learning_rate":0.01,"epochs":5,"momentum":0.9}
LR Schedule
cosine decay
parameters: null
Sequence Length
sequence_length
train_length: null
eval_length: null
Other
other
Eval wall-clock budget guard that truncates adaptation when projected runtime would exceed the 600s limit while continuing scoring.
parameters: {"max_eval_seconds":600,"warmup_chunks":5}
Novel Contributions
- Per-layer GPTQ clip sigmas tuned separately for MLP, attention, and embeddings
- Unfrozen score-first TTT with all non-embedding blocks adapting
- Evaluation wall-clock budget guard that truncates adaptation while preserving scoring legality
- Recipe built on SP8192 base architecture with per-category outlier-aware quantization