val_bpb
1.1035
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,536,878 B
Training Techniques
Architecture
XSA
11-layer XSA-all architecture used as the base model
parameters: {"layers":11}
weight tying
Standard embedding/lm_head tying
parameters: null
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: null
Quantization
int6
bits: 6
scope: naive
Compression
brotli
level: 11
Other
other
byte-shuffle compression with stride=2
parameters: {"stride":2}
other
custom context-only SLOT test-time optimization
parameters: {"steps":8}
Evaluation
sliding window eval
parameters: null
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Novel Contributions
- brotli-11 compression with byte-shuffle stride=2 to reduce model size
- custom context-only SLOT 8-step test-time optimization
- 11-layer XSA-all Rascal II training setup with parallel Muon and coprime loader
- naive int6 quantization