PR #1286

open

Lucky IV — 1.09626897 val_bpb (seed 444)

by newjordanView on GitHub
val_bpb
1.0963
Architecture
Hybrid
Optimizer
AdamW
Artifact Size
15,532,043 B

Training Techniques

Architecture
DeltaNet
8 layers of Gated Linear Attention DeltaNet plus a final standard attention layer
parameters: {"layers":8,"final_attention_layer":1,"n_embd":384}
weight tying
Standard embedding/lm_head tying
parameters: null
Evaluation
sliding window eval
parameters: null
Test-Time Training
Context-Only SLOT
parameters: {"steps":24}
Optimizer
AdamW
weight_decay: null
momentum: null
other_params: {"fused":true}
Other
other
Brotli byte-shuffle used as part of the submission pipeline
parameters: null

Novel Contributions

  • Rascal II combined with brotli byte-shuffle
  • Custom 24-step Context-Only SLOT test-time adaptation at inference
  • Shared delta (1,1,dim) for SLOT adaptation
  • Sliding window evaluation
  • 8-layer DeltaNet hybrid architecture with a final attention layer