val_bpb
1.0963
Architecture
Hybrid
Optimizer
AdamW
Artifact Size
15,532,043 B
Training Techniques
Architecture
DeltaNet
8 layers of Gated Linear Attention DeltaNet plus a final standard attention layer
parameters: {"layers":8,"final_attention_layer":1,"n_embd":384}
weight tying
Standard embedding/lm_head tying
parameters: null
Evaluation
sliding window eval
parameters: null
Test-Time Training
Context-Only SLOT
parameters: {"steps":24}
Optimizer
AdamW
weight_decay: null
momentum: null
other_params: {"fused":true}
Other
other
Brotli byte-shuffle used as part of the submission pipeline
parameters: null
Novel Contributions
- Rascal II combined with brotli byte-shuffle
- Custom 24-step Context-Only SLOT test-time adaptation at inference
- Shared delta (1,1,dim) for SLOT adaptation
- Sliding window evaluation
- 8-layer DeltaNet hybrid architecture with a final attention layer