PR #961

open

Record: 0.0881 BPB — 11L Int5 GPTQ + Order-12 N-gram + Phrase Cache + 65K Chunks

by callithyiaView on GitHub
val_bpb
0.0881
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
~13.0 MB

Training Techniques

Architecture
GQA
Grouped query attention with 8 query heads and 4 KV heads.
parameters: {"query_heads":8,"kv_heads":4}
BigramHash
Bigram hash module used in the architecture.
parameters: {"dimensions":128}
XSA
XSA-4 architectural component.
parameters: {"variant":4}
Partial RoPE
Partial rotary positional embeddings applied to a subset of dimensions.
parameters: {"numerator":16,"denominator":64}
SmearGate
SmearGate component in the model architecture.
parameters: null
VE128
VE128 applied on layers 9-10.
parameters: {"layers":[9,10]}
LeakyReLU
LeakyReLU(0.5) squared activation in the MLP.
parameters: {"negative_slope":0.5,"squared":true}
weight tying
Tied embeddings.
parameters: null
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"checkpoints":15,"final_warmdown":true}
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: {"parameter_banking":true}
Quantization
GPTQ
bits: 5
scope: all
Compression
lzma
level: 9
Evaluation
order-12 n-gram cache
parameters: {"orders":[2,3,4,5,6,7,8,9,10,11,12],"backoff":true,"score_first":true,"backward_looking":true}
long phrase cache
parameters: {"probe_lengths":[64,56,48,36,28,20,16]}
temperature sharpening
parameters: {"temperature":0.85}
65K chunking
parameters: {"chunk_size":65000}
Regularization
logit softcap
parameters: {"value":30}
LR Schedule
warmdown
parameters: null
Sequence Length
sequence_length
train_length: null
eval_length: 65000

Novel Contributions

  • Order-12 backoff n-gram cache combined with a long phrase cache
  • Entropy-adaptive alpha for cache blending
  • Temperature sharpening at T=0.85
  • 65K-token chunking to keep evaluation under the 600s budget
  • Demonstration that cache-heavy evaluation can largely erase large pre-quantization model-quality gaps