val_bpb
1.1145
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
15.38 MB
Training Techniques
Quantization
GPTQ
bits: 5
scope: all
Architecture
XSA
Cross-sequence attention applied to all layers.
parameters: {"layers":11}
SmearGate
Included as part of the architecture.
parameters: null
U-Net skip connections
Skip connections in a U-Net style.
parameters: null
LeakyReLU
LeakyReLU squared MLP activation.
parameters: null
Partial RoPE
Partial rotary positional embeddings.
parameters: {"fraction":"16/64"}
Test-Time Training
score-first TTT
parameters: {"epochs":5,"learning_rate":0.0001,"chunks":262144}
Evaluation
sliding window eval
parameters: {"stride":64}
Optimizer
Parallel Muon
weight_decay: 0.04
momentum: null
other_params: {"ns":5,"lr":0.025}
Weight Averaging
EMA + SWA
parameters: {"ema_decay":0.997,"swa_interval_steps":50}
Initialization
OrthoInit
Used for model initialization.
Regularization
LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
magnitude pruning
parameters: {"sparsity":"3%"}
Compression
zstd
level: 22
Novel Contributions
- INT5 GPTQ quantization with clip_range=15 to fit a larger model under the artifact limit
- XSA applied to all 11 layers
- Legal score-first chunked TTT where tokens are scored before any gradient update
- Coprime-stride data loader without permutation arrays
- Wallclock-adaptive warmdown schedule
- Parallel Muon optimizer with overlapping communication and orthogonalization