PR #831

open

Research: Why Novel Architectures Fail at 16MB — Throughput-Quantization Co-optimization

by sseanliuView on GitHub
val_bpb
1.1284
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
16MB

Training Techniques

Optimizer
Parallel Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"batched_banks":true}
Architecture
XSA
Cross-window/self-attention variant used in the SOTA stack; also referenced as XSA-all in one failed technique.
parameters: {"last_n":4}
EMA
Exponential moving average used as part of the base recipe.
parameters: null
SmearGate
Custom gating/architecture component in the base recipe.
parameters: null
BigramHash
Hash-based architectural component used in the base recipe.
parameters: {"vocab_size":2048}
Quantization
int6
bits: 6
scope: per-row weights
Weight Averaging
EMA
parameters: {"decay":0.997}
LR Schedule
warmdown
parameters: {"warmup_steps":1500,"warmdown_iters":3000}
Regularization
weight decay
parameters: {"muon_wd":0.04,"adam_wd":0.04}
Sequence Length
sequence_length
train_length: 1024
eval_length: 2048
Evaluation
sliding window eval
parameters: {"stride":64,"context_length":2048}
long context eval
parameters: {"cache_tokens":8192,"effective_context":50000}
Test-Time Training
score-first TTT
parameters: null
Other
other
Throughput-quantization co-optimization analysis showing that small per-step overheads can negate BPB gains under the 16MB/600s constraint.
parameters: {"throughput_tax_bpb_per_ms":0.007}

Novel Contributions

  • Systematic evaluation of six March 2026 architectural innovations on the PR #549 SOTA stack
  • Claim that throughput-quantization co-optimization is the binding constraint at 16MB/600s
  • Throughput tax formula estimating BPB gain required per millisecond of overhead
  • Observation that MLP shape affects quantizability
  • Observation that hypersphere normalization is incompatible with per-row quantization
  • Proposal of Neural Cache: cross-window KV caching for extended-context evaluation
  • Use of cached K/V pairs across sliding windows to extend effective context without changing model weights