PR #978

open

Review: Rerun of #972 with actual full-vocab normalization

by AnirudhRahulView on GitHub
val_bpb
1.5134
Architecture
Transformer
Optimizer
Muon
Artifact Size
14,941,134 bytes

Training Techniques

Quantization
int6
bits: 6
scope: all
Weight Averaging
EMA
parameters: {"decay":0.997}
Architecture
weight tying
Tied input and output embeddings.
parameters: null
U-Net skip connections
Skip connections in the transformer stack.
parameters: null
GQA
Grouped query attention.
parameters: {"layers":10,"dimensions":512,"heads":8,"kv_heads":4}
BigramHash
Hash-table embedding for token bigrams.
parameters: {"buckets":2048,"dimensions":128}
SmearGate
Per-dimension gate blending each token with the previous token.
parameters: null
XSA
Self-value bias removal on the last 4 layers.
parameters: {"layers":4}
MLP3x
Wider MLP with 3x expansion.
parameters: null
GELU pre-enrichment
Wider nonlinear pre-transformer enrichment block.
parameters: {"input_dim":512,"hidden_dim":768}
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: null
Regularization
weight decay
parameters: null
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048

Novel Contributions

  • Fixes full-vocab normalization in eval_val_sliding() by dividing by summed hashed-vocab mass instead of ctx_count + beta.
  • Provides an honest rerun showing the normalized n-gram path degrades to 1.51343368 BPB and loses to the neural sliding baseline.
  • Updates the submission README and metadata to retract the earlier incorrect 0.3922 claim.
  • Demonstrates that the previously reported gain was due to an unnormalized denominator rather than a true full-vocab posterior.
  • Quantifies the collision premium and compares normalized n-gram scoring against the neural sliding-window baseline.