PR #1309
openNon-record: Exact Sequence Matching on PR #1019 (1.1143 BPB)
by cadenmcmannView on GitHub
val_bpb
1.1143
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
15,842,788 bytes
Training Techniques
Quantization
GPTQ
bits: 6
scope: all
Architecture
BigramHash
Base model uses a bigram hash component as part of the architecture.
parameters: {"size":3072,"dimension":112}
XSA
Attention uses XSA across all layers.
parameters: {"layers":11}
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: null
Compression
lzma
level: 9
Evaluation
sliding window eval
parameters: {"stride":64}
Other
other
Exact sequence matching cache over 8-12 token n-grams during evaluation, blending cached next-token predictions into model outputs.
parameters: {"min_order":8,"max_order":12,"lambda":0.15,"match_rate":0.0503,"match_accuracy":0.6557}
LR Schedule
warmdown
parameters: {"warmdown_iters":4000}
Novel Contributions
- Eval-time exact sequence matching using 8-12 token n-grams
- Blending cached next-token predictions with model softmax during sliding window evaluation
- Improved sliding window BPB from 1.1152 to 1.1143 without retraining
- Applied the same eval-time technique to the current SOTA base model PR #1019