PR #1094

open

Record: 0.4027 BPB — Swarm-Designed Causal BackoffNgramMixer (3-seed mean, std 0.0015)

by michaelwinczukView on GitHub

val_bpb

0.4027

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

15.96-15.97 MB

Training Techniques

Architecture

LeakyReLU

Uses LeakyReLU squared activation in the transformer MLP stack.

parameters: {"power":2,"slope":0.75}

MTP heads

Multi-token prediction heads used during training.

parameters: {"heads":2}

BigramHash

Adds a bigram hash component to the model stack.

parameters: {"size":2048}

SmearGate

Uses SmearGate in the architecture.

parameters: null

XSA

Applies XSA to the last layers of the model.

parameters: {"layers":4}

Partial RoPE

Uses partial rotary positional embeddings.

parameters: null

LN Scale

LayerNorm scale modification.

parameters: null

Optimizer

Parallel Muon

weight_decay: null

momentum: 0.99

other_params: {"warmup_start_momentum":0.92}

Weight Averaging

EMA

parameters: {"decay":0.997}

Quantization

GPTQ-lite

bits: 6

scope: all

Evaluation

sliding window eval

parameters: {"stride":64}

causal sequential chunk eval

parameters: {"score_first":true,"update_after":true}

Sequence Length

sequence_length

train_length: null

eval_length: null

Compression

lzma

level: null

Novel Contributions

Strictly causal sequential chunk evaluation that scores tokens before updating n-gram counts
Causal BackoffNgramMixer with orders 2-10 and 4M hash buckets
Entropy-adaptive alpha for mixing neural and n-gram probabilities
Swarm-designed training and architecture selection with transparent decision logging
Knowledge graph-conditioned embedding initialization