PR #786
open0.8128 BPB: Classical Compression Eval + N-gram Backoff on PR #549 Base
by shinegami-2002View on GitHub
val_bpb
0.8128
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
15.88 MB
Training Techniques
Architecture
MLP3x
Three-layer MLP with LeakyReLU(0.5)^2 activation.
parameters: {"layers":3}
BigramHash
Uses a bigram hash component in the base model.
parameters: {"size":1536}
XSA
Applies XSA to the last 4 layers.
parameters: {"layers":4}
Partial RoPE
Uses rotary positional embeddings on a subset of dimensions.
parameters: {"dimensions":16}
weight tying
Tied embeddings are used in the model.
parameters: null
Regularization
layerwise LN scale
parameters: {"formula":"1/sqrt(layer+1)"}
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"every":50}
Quantization
GPTQ-lite
bits: 6
scope: model
Optimizer
Parallel Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"muon_momentum_warmup_start":0.92,"muon_momentum_warmup_steps":1500,"adam_weight_decay":0.04}
Adam
weight_decay: 0.04
momentum: null
other_params: {"learning_rate":0.025}
Evaluation
sliding window eval
parameters: {"stride":64}
Other
other
Multi-order n-gram backoff (orders 2-7) with entropy-adaptive alpha mixing during evaluation, inspired by classical compression methods.
parameters: {"orders":[2,3,4,5,6,7],"alpha_formula":"0.05 + 0.55 * sigmoid(2 * (H - 4.0))"}
other
Vectorized numpy-based eval-time augmentation with flat hash tables and scatter-add updates.
parameters: {"hash_tables_per_order":2,"buckets_per_order":4000000}
Compression
lzma
level: null
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Novel Contributions
- Eval-time augmentation using multi-order n-gram backoff (orders 2-7)
- Entropy-adaptive alpha mixing between neural and n-gram predictions
- Vectorized numpy implementation for compressed evaluation
- Classical compression-inspired approach based on cmix/PAQ ideas
- Backward-looking only evaluation updates with zero artifact cost