PR #1094

open

Record: 0.4027 BPB — Swarm-Designed Causal BackoffNgramMixer (3-seed mean, std 0.0015)

by michaelwinczukView on GitHub
val_bpb
0.4027
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
15.96-15.97 MB

Training Techniques

Architecture
LeakyReLU
Uses LeakyReLU squared activation in the transformer MLP stack.
parameters: {"power":2,"slope":0.75}
MTP heads
Multi-token prediction heads used during training.
parameters: {"heads":2}
BigramHash
Adds a bigram hash component to the model stack.
parameters: {"size":2048}
SmearGate
Uses SmearGate in the architecture.
parameters: null
XSA
Applies XSA to the last layers of the model.
parameters: {"layers":4}
Partial RoPE
Uses partial rotary positional embeddings.
parameters: null
LN Scale
LayerNorm scale modification.
parameters: null
Optimizer
Parallel Muon
weight_decay: null
momentum: 0.99
other_params: {"warmup_start_momentum":0.92}
Weight Averaging
EMA
parameters: {"decay":0.997}
Quantization
GPTQ-lite
bits: 6
scope: all
Evaluation
sliding window eval
parameters: {"stride":64}
causal sequential chunk eval
parameters: {"score_first":true,"update_after":true}
Sequence Length
sequence_length
train_length: null
eval_length: null
Compression
lzma
level: null

Novel Contributions

  • Strictly causal sequential chunk evaluation that scores tokens before updating n-gram counts
  • Causal BackoffNgramMixer with orders 2-10 and 4M hash buckets
  • Entropy-adaptive alpha for mixing neural and n-gram probabilities
  • Swarm-designed training and architecture selection with transparent decision logging
  • Knowledge graph-conditioned embedding initialization