PR #2088

open

Non-record: Causal Bigram Blending — eval-time BPB improvement (1×H20…

by MaxIv25View on GitHub

val_bpb

1.0744

Architecture

Transformer

Optimizer

Muon

Artifact Size

16,148–16,151 KB

Training Techniques

Architecture

U-Net skip connections

11-layer recurrent transformer with U-Net skips and parallel residuals in later layers.

parameters: {"layers":11}

weight tying

Tied embeddings.

parameters: null

Partial RoPE

Rotary position embeddings applied to a subset of dimensions.

parameters: {"dimensions":16}

SmearGate

SmearGate used in the model.

parameters: null

Sparse Attention Gate

Sparse attention gating used in the model.

parameters: null

Regularization

logit softcap

parameters: {"value":30}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"backend":"Polar Express Newton-Schulz","steps":5}

Quantization

GPTQ

bits: 6

scope: attn+mlp

GPTQ

bits: 7

scope: embeddings

LQER

bits: 4

scope: rank-4

Compression

brotli

level: null

Weight Averaging

EMA

parameters: null

Test-Time Training

score-first TTT

parameters: null

Evaluation

bigram blending

parameters: {"lambda":0.03,"adaptive_confidence":true,"laplace_smoothing":true}

Novel Contributions

Causal Bigram Blending at evaluation time
Adaptive blending of model log-probabilities with an online causal bigram prior
Score-before-update compliant bigram count updates
Reported ~0.011 BPB improvement with no training cost or artifact size increase