PR #727

open

Record: First Legal Sub-1.0 BPB — Multi-order N-gram Backoff + Entropy-Adaptive Alpha (val_bpb=0.9674, 3-seed)

by Asukabot0View on GitHub
val_bpb
0.9674
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.99 MB

Training Techniques

Architecture
11L Transformer
11-layer Transformer with 512-dimensional hidden size, GQA, MLP expansion, and several custom architectural components.
parameters: {"layers":11,"d_model":512,"gqa_heads":8,"kv_heads":4,"mlp_multiplier":3}
LeakyReLU
Squared LeakyReLU activation with negative slope 0.5.
parameters: {"negative_slope":0.5}
XSA
Cross-sequence attention across all layers.
parameters: {"last_n":11}
Value Residual
Adds value residual connections.
parameters: null
Gated Attention
Uses gated attention mechanism.
parameters: null
SmearGate
Uses SmearGate embedding/attention component.
parameters: null
BigramHash
Bigram hashing feature with 4096 buckets.
parameters: {"buckets":4096}
Partial RoPE
Applies rotary positional embeddings partially.
parameters: {"dimensions":"16/64"}
tied embeddings
Input and output embeddings are tied.
parameters: null
Regularization
LN Scale
parameters: null
Weight Averaging
EMA
parameters: {"decay":0.997}
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"warmup_start":0.92,"warmup_steps":1500,"matrix_lr":0.025,"scalar_lr":0.025,"tied_embed_lr":0.035}
Quantization
int6
bits: 6
scope: per-row weights
Compression
zstd
level: 16
Evaluation
n-gram eval cache
parameters: {"orders":"2-7","backoff":true,"entropy_adaptive_alpha":true}
Other
other
Entropy-adaptive interpolation between LM logits and n-gram statistics using model entropy.
parameters: {"alpha_formula":"0.05 + 0.55 * sigmoid(2 * (H - 4.0))"}

Novel Contributions

  • Multi-order n-gram backoff over orders 2 through 7
  • Entropy-adaptive alpha for interpolating LM and n-gram statistics
  • First legal sub-1.0 BPB record claim
  • Score-first, backward-looking eval-time n-gram cache