PR #796

open

Record: 0.6567 BPB — Prefill Cache + 7-Gram Entropy-Adaptive + EBLS

by Robby955View on GitHub
val_bpb
0.6567
Architecture
EBLS Transformer
Optimizer
Artifact Size
~15.87 MB

Training Techniques

Quantization
GPTQ
bits: 6
scope: all
Architecture
EBLS
Empirical Bayes Layer Sharing with 3 shared blocks x 3 loops and 2 unique layers.
parameters: {"layers":11,"shared_blocks":3,"loops":3}
XSA
XSA applied to all 11 layers.
parameters: {"layers":11}
BigramHash
Auxiliary hash-based bigram component.
parameters: {"vocab_size":3072,"dimension":128}
MLP3x
Three-layer MLP with LeakyReLU squared activation.
parameters: {"mlp_multiplier":3}
KV head count
Grouped-query attention with 4 KV heads.
parameters: {"kv_heads":4,"attention_heads":8}
Partial RoPE
Rotary positional embeddings applied to a subset of dimensions.
parameters: {"rope_dims":16,"total_dims":64}
Regularization
layerwise LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
Weight Averaging
EMA + SWA
parameters: {"ema_decay":0.997,"swa_interval":50}
Compression
lzma
level: 9
Evaluation
sliding window eval
parameters: {"stride":64}
7-gram causal cache with entropy-adaptive blending
parameters: {"orders":[2,3,4,5,6,7],"min_count":2,"buckets_per_table":4194304,"entropy_base":0.05,"entropy_range":0.55,"entropy_scale":2,"entropy_threshold":4}
Other
other
Distributed cache pre-fill for evaluation ranks using only preceding tokens to make multi-GPU evaluation identical to single-GPU sequential evaluation.
parameters: {"distributed_ranks":8}

Novel Contributions

  • Distributed cache pre-fill for multi-GPU evaluation
  • 7-gram causal cache with backoff cascade
  • Entropy-adaptive blending between model and n-gram predictions
  • EBLS architecture with shared blocks and loops
  • Val-GPTQ int6 quantization with LZMA compression