PR #786

open

0.8128 BPB: Classical Compression Eval + N-gram Backoff on PR #549 Base

by shinegami-2002View on GitHub

val_bpb

0.8128

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

15.88 MB

Training Techniques

Architecture

MLP3x

Three-layer MLP with LeakyReLU(0.5)^2 activation.

parameters: {"layers":3}

BigramHash

Uses a bigram hash component in the base model.

parameters: {"size":1536}

XSA

Applies XSA to the last 4 layers.

parameters: {"layers":4}

Partial RoPE

Uses rotary positional embeddings on a subset of dimensions.

parameters: {"dimensions":16}

weight tying

Tied embeddings are used in the model.

parameters: null

Regularization

layerwise LN scale

parameters: {"formula":"1/sqrt(layer+1)"}

Weight Averaging

EMA

parameters: {"decay":0.997}

SWA

parameters: {"every":50}

Quantization

GPTQ-lite

bits: 6

scope: model

Optimizer

Parallel Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"muon_momentum_warmup_start":0.92,"muon_momentum_warmup_steps":1500,"adam_weight_decay":0.04}

Adam

weight_decay: 0.04

momentum: null

other_params: {"learning_rate":0.025}

Evaluation

sliding window eval

parameters: {"stride":64}

Other

other

Multi-order n-gram backoff (orders 2-7) with entropy-adaptive alpha mixing during evaluation, inspired by classical compression methods.

parameters: {"orders":[2,3,4,5,6,7],"alpha_formula":"0.05 + 0.55 * sigmoid(2 * (H - 4.0))"}

other

Vectorized numpy-based eval-time augmentation with flat hash tables and scatter-add updates.

parameters: {"hash_tables_per_order":2,"buckets_per_order":4000000}

Compression

lzma

level: null

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Novel Contributions

Eval-time augmentation using multi-order n-gram backoff (orders 2-7)
Entropy-adaptive alpha mixing between neural and n-gram predictions
Vectorized numpy implementation for compressed evaluation
Classical compression-inspired approach based on cmix/PAQ ideas
Backward-looking only evaluation updates with zero artifact cost