PR #796

open

Record: 0.6567 BPB — Prefill Cache + 7-Gram Entropy-Adaptive + EBLS

by Robby955View on GitHub

val_bpb

0.6567

Architecture

EBLS Transformer

Optimizer

—

Artifact Size

~15.87 MB

Training Techniques

Quantization

GPTQ

bits: 6

scope: all

Architecture

EBLS

Empirical Bayes Layer Sharing with 3 shared blocks x 3 loops and 2 unique layers.

parameters: {"layers":11,"shared_blocks":3,"loops":3}

XSA

XSA applied to all 11 layers.

parameters: {"layers":11}

BigramHash

Auxiliary hash-based bigram component.

parameters: {"vocab_size":3072,"dimension":128}

MLP3x

Three-layer MLP with LeakyReLU squared activation.

parameters: {"mlp_multiplier":3}

KV head count

Grouped-query attention with 4 KV heads.

parameters: {"kv_heads":4,"attention_heads":8}

Partial RoPE

Rotary positional embeddings applied to a subset of dimensions.

parameters: {"rope_dims":16,"total_dims":64}

Regularization

layerwise LN scale

parameters: {"scale":"1/sqrt(layer+1)"}

Weight Averaging

EMA + SWA

parameters: {"ema_decay":0.997,"swa_interval":50}

Compression

lzma

level: 9

Evaluation

sliding window eval

parameters: {"stride":64}

7-gram causal cache with entropy-adaptive blending

parameters: {"orders":[2,3,4,5,6,7],"min_count":2,"buckets_per_table":4194304,"entropy_base":0.05,"entropy_range":0.55,"entropy_scale":2,"entropy_threshold":4}

Other

other

Distributed cache pre-fill for evaluation ranks using only preceding tokens to make multi-GPU evaluation identical to single-GPU sequential evaluation.

parameters: {"distributed_ranks":8}

Novel Contributions

Distributed cache pre-fill for multi-GPU evaluation
7-gram causal cache with backoff cascade
Entropy-adaptive blending between model and n-gram predictions
EBLS architecture with shared blocks and loops
Val-GPTQ int6 quantization with LZMA compression