PR #1309

open

Non-record: Exact Sequence Matching on PR #1019 (1.1143 BPB)

val_bpb

1.1143

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

15,842,788 bytes

Training Techniques

Quantization

GPTQ

bits: 6

scope: all

Architecture

BigramHash

Base model uses a bigram hash component as part of the architecture.

parameters: {"size":3072,"dimension":112}

XSA

Attention uses XSA across all layers.

parameters: {"layers":11}

Optimizer

Parallel Muon

weight_decay: null

momentum: null

other_params: null

Compression

lzma

level: 9

Evaluation

sliding window eval

parameters: {"stride":64}

Other

other

Exact sequence matching cache over 8-12 token n-grams during evaluation, blending cached next-token predictions into model outputs.

parameters: {"min_order":8,"max_order":12,"lambda":0.15,"match_rate":0.0503,"match_accuracy":0.6557}

LR Schedule

warmdown

parameters: {"warmdown_iters":4000}

Eval-time exact sequence matching using 8-12 token n-grams
Blending cached next-token predictions with model softmax during sliding window evaluation
Improved sliding window BPB from 1.1152 to 1.1143 without retraining
Applied the same eval-time technique to the current SOTA base model PR #1019