val_bpb
1.1099
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
15.5MB
Training Techniques
Architecture
XSA
XSA-all attention/sequence architecture variant
parameters: null
BigramHash
Bigram2048 token hashing/embedding component
parameters: {"dimensions":2048}
RoPE
Rotary positional embeddings with reduced dimension
parameters: {"dimensions":16}
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: null
Weight Averaging
SWA
parameters: null
Quantization
late QAT
bits: 6
scope: embeddings and 5 layers
Compression
zstd
level: null
Evaluation
sliding window eval
parameters: null
Other
other
Parallel Muon optimizer with coprime loader and GPU mixer prefill for fast startup
parameters: {"coprime_loader":true,"gpu_prefill":true}
Novel Contributions
- XSA-all architecture variant
- Parallel Muon optimization
- Coprime loader
- Bigram2048 component
- RoPE16 positional embedding
- SWA
- Late QAT with naive int6 embedding and 5 layers
- zstd-compressed submission