val_bpb
1.1354
Architecture
11L Transformer
Optimizer
Muon + AdamW
Artifact Size
15.85 MB
Training Techniques
Quantization
int6
bits: 6
scope: all
Compression
zstd
level: 22
Architecture
XSA
Partial exclusive self-attention applied only to the last 3 layers to debias self-attention efficiently in a GQA-aware way.
parameters: {"layers":3}
RoPE
Extended positional encoding using a larger RoPE base.
parameters: {"base":50000}
SmearGate
Custom gating mechanism used in the base architecture.
parameters: null
BigramHash
Bigram hashing with 2048 buckets used in the base architecture.
parameters: {"buckets":2048}
Test-Time Training
full TTT
parameters: {"epochs":3,"learning_rate":0.002,"freeze_blocks":2}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}
AdamW
weight_decay: 0.04
momentum: null
other_params: {"learning_rate":0.025}
Weight Averaging
SWA
parameters: {"checkpoints_averaged":7}
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmdown_iters":3000,"warmup_steps":1500}
Initialization
OrthoInit
Orthogonal initialization used in the base architecture.
Novel Contributions
- Partial XSA applied to the last 3 layers
- Test-time training with 3-epoch full-model SGD and early block freezing
- Batch size optimization to 524K tokens for more gradient updates
- RoPE base increased to 50K
- Sliding-window evaluation with stride 64
- Int6 quantization with zstd-22 compression under the 16MB limit