val_bpb
1.1211
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
15.95 MB
Training Techniques
Architecture
XSA
Cross-sequence attention applied to all 11 layers instead of only the last 4 layers.
parameters: {"layers":11}
other
Manifold-constrained hyper-connections with learnable alpha/beta residual mixing per block under a norm constraint.
parameters: {"extra_params":22}
Quantization
QAT
bits: 6
scope: all
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
full TTT
parameters: {"enabled":1}
Novel Contributions
- XSA applied to all 11 layers
- Manifold-constrained hyper-connections with 22 extra parameters
- Full-training QAT from step 1
- Parallel Muon optimizer stack
- Sliding window evaluation and legal TTT improvement