PR #1120

open

val_bpb 1.1099 (3-seed mean) Rascal

by newjordanView on GitHub
val_bpb
1.1099
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
15.5MB

Training Techniques

Architecture
XSA
XSA-all attention/sequence architecture variant
parameters: null
BigramHash
Bigram2048 token hashing/embedding component
parameters: {"dimensions":2048}
RoPE
Rotary positional embeddings with reduced dimension
parameters: {"dimensions":16}
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: null
Weight Averaging
SWA
parameters: null
Quantization
late QAT
bits: 6
scope: embeddings and 5 layers
Compression
zstd
level: null
Evaluation
sliding window eval
parameters: null
Other
other
Parallel Muon optimizer with coprime loader and GPU mixer prefill for fast startup
parameters: {"coprime_loader":true,"gpu_prefill":true}

Novel Contributions

  • XSA-all architecture variant
  • Parallel Muon optimization
  • Coprime loader
  • Bigram2048 component
  • RoPE16 positional embedding
  • SWA
  • Late QAT with naive int6 embedding and 5 layers
  • zstd-compressed submission