PR #774

open

Record: Order-Adaptive Entropy Gating + XSA-All (val_bpb=0.9370)

by travispchenView on GitHub
val_bpb
0.9370
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
~15.9 MB

Training Techniques

Architecture
XSA
Cross-Self-Attention extended from the last 4 layers to all 11 transformer layers.
parameters: {"layers":11}
Partial RoPE
Uses partial rotary positional embeddings.
parameters: {"dimensions":16}
MLP3x
Transformer MLP widened to 3x hidden size with LeakyReLU^2 activation.
parameters: {"hidden_size_multiplier":3}
BigramHash
Bigram hash embedding used as part of the model architecture.
parameters: {"buckets":1536}
Value Embedding
Value Embedding applied on later layers.
parameters: {"dimension":128,"layers":[9,10]}
Quantization
GPTQ
bits: 6
scope: all
Optimizer
Parallel Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"adam_weight_decay":0.04,"matrix_lr":0.025,"scalar_lr":0.025,"tied_embed_lr":0.035,"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"every":50}
Evaluation
sliding window eval
parameters: {"stride":64}
multi-order n-gram eval
parameters: {"orders":[2,7]}
Other
other
Order-adaptive entropy gating that sets entropy thresholds based on matched n-gram order during evaluation.
parameters: {"entropy_center":3,"slope":0.25,"min_order":2}
other
Score-first legality via n-gram cache updated only after scoring each sliding window batch.
parameters: null
Regularization
layerwise LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Compression
lzma
level: null

Novel Contributions

  • Order-adaptive entropy gating with per-n-gram-order entropy thresholds
  • Extension of XSA from the last 4 layers to all 11 layers
  • Improved n-gram evaluation by trusting higher-order matches at lower entropy thresholds
  • Score-first n-gram cache legality during evaluation