PR #777

open

Record: 0.9623 BPB — 7-Gram Entropy Cache + XSA-all + EBLS

by Robby955View on GitHub

val_bpb

0.9623

Architecture

Transformer

Optimizer

—

Artifact Size

~15.87 MB

Training Techniques

Evaluation

sliding window eval

parameters: {"stride":64}

entropy-adaptive cache blending

parameters: {"alpha_formula":"0.05 + 0.55 * sigmoid(2 * (H - 4.0))"}

Other

other

7-gram entropy-adaptive causal cache / PPM-style n-gram backoff blended with the neural model during evaluation

parameters: {"orders":"2-7","buckets_per_table":4194304,"min_count":2,"alpha_base":0.05,"alpha_range":0.55,"alpha_scale":2,"alpha_threshold":4}

Architecture

XSA

XSA-all applied across all 11 layers

parameters: {"layers":11}

EBLS

Empirical Bayes Layer Sharing with shared blocks and loops

parameters: {"shared_blocks":3,"loops":3}

LoRA

Low-rank adaptation used in the model

parameters: {"rank":8}

BigramHash

Bigram hash embedding/component used in the architecture

parameters: {"vocab":3072,"dim":128}

RoPE

Partial rotary positional embeddings

parameters: {"dims":"16/64"}

KV head count

Grouped-query attention with reduced KV heads

parameters: {"heads":8,"kv_heads":4}

MLP3x

Three-layer MLP with LeakyReLU squared activation

parameters: {"activation":"LeakyReLU(0.5)^2"}

Weight Averaging

EMA + SWA

parameters: {"ema_decay":0.997,"swa_interval":50}

Quantization

GPTQ

bits: 6

scope: all

Compression

lzma

level: 9

Novel Contributions

7-gram causal entropy-adaptive n-gram cache blended with neural predictions
Strictly backward-looking cache updates with no oracle/min(NLL) selection
Entropy-based adaptive alpha that increases cache weight when model entropy is high
EBLS layer sharing with 3 shared blocks and 3 loops
XSA-all across all 11 layers
Val-calibrated GPTQ int6 quantization combined with LZMA compression