PR #831

open

Research: Why Novel Architectures Fail at 16MB — Throughput-Quantization Co-optimization

by sseanliuView on GitHub

val_bpb

1.1284

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

16MB

Training Techniques

Optimizer

Parallel Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"batched_banks":true}

Architecture

XSA

Cross-window/self-attention variant used in the SOTA stack; also referenced as XSA-all in one failed technique.

parameters: {"last_n":4}

EMA

Exponential moving average used as part of the base recipe.

parameters: null

SmearGate

Custom gating/architecture component in the base recipe.

parameters: null

BigramHash

Hash-based architectural component used in the base recipe.

parameters: {"vocab_size":2048}

Quantization

int6

bits: 6

scope: per-row weights

Weight Averaging

EMA

parameters: {"decay":0.997}

LR Schedule

warmdown

parameters: {"warmup_steps":1500,"warmdown_iters":3000}

Regularization

weight decay

parameters: {"muon_wd":0.04,"adam_wd":0.04}

Sequence Length

sequence_length

train_length: 1024

eval_length: 2048

Evaluation

sliding window eval

parameters: {"stride":64,"context_length":2048}

long context eval

parameters: {"cache_tokens":8192,"effective_context":50000}

Test-Time Training

score-first TTT

parameters: null

Other

other

Throughput-quantization co-optimization analysis showing that small per-step overheads can negate BPB gains under the 16MB/600s constraint.

parameters: {"throughput_tax_bpb_per_ms":0.007}

Novel Contributions

Systematic evaluation of six March 2026 architectural innovations on the PR #549 SOTA stack
Claim that throughput-quantization co-optimization is the binding constraint at 16MB/600s
Throughput tax formula estimating BPB gain required per millisecond of overhead
Observation that MLP shape affects quantizability
Observation that hypersphere normalization is incompatible with per-row quantization
Proposal of Neural Cache: cross-window KV caching for extended-context evaluation
Use of cached K/V pairs across sliding windows to extend effective context without changing model weights