PR #1768

open

Add non-record 16MB SP1024 ShareVLast3 3-seed submission

val_bpb
1.2792
Architecture
Transformer
Optimizer
Artifact Size
15973626 bytes

Training Techniques

Architecture
XSA
XSA applied on the last 4 layers
parameters: {"layers":4}
U-Net skip connections
Structured skip fusion / BIFPN2 mode enabled
parameters: {"BIFPN2_MODE":1}
weight tying
Shared V across the last 3 layers
parameters: {"layers":3}
BigramHash
2-gram scaffold with fade-out
parameters: {"max_n":2,"fade_enable":1}
Sequence Length
sequence_length
train_length: null
eval_length: null

Novel Contributions

  • Stable non-record 16MB submission under the artifact limit
  • Official SP1024 tokenizer and FineWeb SP1024 dataset
  • Structured skip fusion with BIFPN2 mode
  • XSA on the last 4 layers
  • 2-gram scaffold with fade-out
  • Shared V across the last 3 layers
  • 3-seed submission with representative seed 2027