PR #1094
openRecord: 0.4027 BPB — Swarm-Designed Causal BackoffNgramMixer (3-seed mean, std 0.0015)
by michaelwinczukView on GitHub
val_bpb
0.4027
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
15.96-15.97 MB
Training Techniques
Architecture
LeakyReLU
Uses LeakyReLU squared activation in the transformer MLP stack.
parameters: {"power":2,"slope":0.75}
MTP heads
Multi-token prediction heads used during training.
parameters: {"heads":2}
BigramHash
Adds a bigram hash component to the model stack.
parameters: {"size":2048}
SmearGate
Uses SmearGate in the architecture.
parameters: null
XSA
Applies XSA to the last layers of the model.
parameters: {"layers":4}
Partial RoPE
Uses partial rotary positional embeddings.
parameters: null
LN Scale
LayerNorm scale modification.
parameters: null
Optimizer
Parallel Muon
weight_decay: null
momentum: 0.99
other_params: {"warmup_start_momentum":0.92}
Weight Averaging
EMA
parameters: {"decay":0.997}
Quantization
GPTQ-lite
bits: 6
scope: all
Evaluation
sliding window eval
parameters: {"stride":64}
causal sequential chunk eval
parameters: {"score_first":true,"update_after":true}
Sequence Length
sequence_length
train_length: null
eval_length: null
Compression
lzma
level: null
Novel Contributions
- Strictly causal sequential chunk evaluation that scores tokens before updating n-gram counts
- Causal BackoffNgramMixer with orders 2-10 and 4M hash buckets
- Entropy-adaptive alpha for mixing neural and n-gram probabilities
- Swarm-designed training and architecture selection with transparent decision logging
- Knowledge graph-conditioned embedding initialization