PR #702

open

Record: 1.0240 BPB — Multi-Order N-gram Backoff + Entropy-Adaptive Alpha (100% autonomous research via goldfish)

by lukacfView on GitHub

val_bpb

1.0244

Architecture

Transformer

Optimizer

AdamW

Artifact Size

15.79 MB

Training Techniques

Architecture

XSA

XSA-all attention variant used in the 11-layer transformer.

parameters: {"layers":11}

SmearGate

SmearGate component included in the base architecture.

parameters: null

BigramHash

BigramHash feature used in the base architecture and referenced in prior baseline.

parameters: {"size":2048}

RoPE

Partial RoPE applied to the model.

parameters: {"dimensions":"16/64"}

Quantization

int6 QAT

bits: 6

scope: all

Compression

zstd

level: 22

Weight Averaging

EMA

parameters: {"decay":0.997}

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

full TTT

parameters: {"epochs":100,"learning_rate":0.001}

LR Schedule

cosine decay

parameters: {"scheduler":"CosineAnnealingLR","t_max":100,"eta_min":0.00001}

Regularization

weight decay

parameters: null

Other

other

Entropy-adaptive n-gram mixing during evaluation, increasing reliance on n-gram predictions when model entropy is high.

parameters: {"alpha_formula":"0.05 + 0.35 * sigmoid(2 * (H - 4.0))"}

other

Multi-order n-gram backoff cache used at evaluation time, backing off from 5-gram to 4-gram, 3-gram, and 2-gram contexts.

parameters: {"orders":[2,3,4,5]}

Novel Contributions

Multi-order n-gram backoff across 2-gram to 5-gram contexts
Entropy-adaptive mixing weight based on model entropy
Score-first eval-time n-gram cache updated only after scoring each token
Proper distribution-preserving mixture of model and n-gram probabilities
Autonomous research workflow using Goldfish ML and Meerkat
3-seed validation of the submission