PR #2088
openNon-record: Causal Bigram Blending — eval-time BPB improvement (1×H20…
by MaxIv25View on GitHub
val_bpb
1.0744
Architecture
Transformer
Optimizer
Muon
Artifact Size
16,148–16,151 KB
Training Techniques
Architecture
U-Net skip connections
11-layer recurrent transformer with U-Net skips and parallel residuals in later layers.
parameters: {"layers":11}
weight tying
Tied embeddings.
parameters: null
Partial RoPE
Rotary position embeddings applied to a subset of dimensions.
parameters: {"dimensions":16}
SmearGate
SmearGate used in the model.
parameters: null
Sparse Attention Gate
Sparse attention gating used in the model.
parameters: null
Regularization
logit softcap
parameters: {"value":30}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"backend":"Polar Express Newton-Schulz","steps":5}
Quantization
GPTQ
bits: 6
scope: attn+mlp
GPTQ
bits: 7
scope: embeddings
LQER
bits: 4
scope: rank-4
Compression
brotli
level: null
Weight Averaging
EMA
parameters: null
Test-Time Training
score-first TTT
parameters: null
Evaluation
bigram blending
parameters: {"lambda":0.03,"adaptive_confidence":true,"laplace_smoothing":true}
Novel Contributions
- Causal Bigram Blending at evaluation time
- Adaptive blending of model log-probabilities with an online causal bigram prior
- Score-before-update compliant bigram count updates
- Reported ~0.011 BPB improvement with no training cost or artifact size increase