PR #978

open

Review: Rerun of #972 with actual full-vocab normalization

by AnirudhRahulView on GitHub

val_bpb

1.5134

Architecture

Transformer

Optimizer

Muon

Artifact Size

14,941,134 bytes

Training Techniques

Quantization

int6

bits: 6

scope: all

Weight Averaging

EMA

parameters: {"decay":0.997}

Architecture

weight tying

Tied input and output embeddings.

parameters: null

U-Net skip connections

Skip connections in the transformer stack.

parameters: null

GQA

Grouped query attention.

parameters: {"layers":10,"dimensions":512,"heads":8,"kv_heads":4}

BigramHash

Hash-table embedding for token bigrams.

parameters: {"buckets":2048,"dimensions":128}

SmearGate

Per-dimension gate blending each token with the previous token.

parameters: null

XSA

Self-value bias removal on the last 4 layers.

parameters: {"layers":4}

MLP3x

Wider MLP with 3x expansion.

parameters: null

GELU pre-enrichment

Wider nonlinear pre-transformer enrichment block.

parameters: {"input_dim":512,"hidden_dim":768}

Compression

lzma

level: null

Evaluation

sliding window eval

parameters: null

Regularization

weight decay

parameters: null

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

Novel Contributions

Fixes full-vocab normalization in eval_val_sliding() by dividing by summed hashed-vocab mass instead of ctx_count + beta.
Provides an honest rerun showing the normalized n-gram path degrades to 1.51343368 BPB and loses to the neural sliding baseline.
Updates the submission README and metadata to retract the earlier incorrect 0.3922 claim.
Demonstrates that the previously reported gain was due to an unnormalized denominator rather than a true full-vocab posterior.
Quantifies the collision premium and compares normalized n-gram scoring against the neural sliding-window baseline.