val_bpb
1.5134
Architecture
Transformer
Optimizer
Muon
Artifact Size
14,941,134 bytes
Training Techniques
Quantization
int6
bits: 6
scope: all
Weight Averaging
EMA
parameters: {"decay":0.997}
Architecture
weight tying
Tied input and output embeddings.
parameters: null
U-Net skip connections
Skip connections in the transformer stack.
parameters: null
GQA
Grouped query attention.
parameters: {"layers":10,"dimensions":512,"heads":8,"kv_heads":4}
BigramHash
Hash-table embedding for token bigrams.
parameters: {"buckets":2048,"dimensions":128}
SmearGate
Per-dimension gate blending each token with the previous token.
parameters: null
XSA
Self-value bias removal on the last 4 layers.
parameters: {"layers":4}
MLP3x
Wider MLP with 3x expansion.
parameters: null
GELU pre-enrichment
Wider nonlinear pre-transformer enrichment block.
parameters: {"input_dim":512,"hidden_dim":768}
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: null
Regularization
weight decay
parameters: null
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Novel Contributions
- Fixes full-vocab normalization in eval_val_sliding() by dividing by summed hashed-vocab mass instead of ctx_count + beta.
- Provides an honest rerun showing the normalized n-gram path degrades to 1.51343368 BPB and loses to the neural sliding baseline.
- Updates the submission README and metadata to retract the earlier incorrect 0.3922 claim.
- Demonstrates that the previously reported gain was due to an unnormalized denominator rather than a true full-vocab posterior.
- Quantifies the collision premium and compares normalized n-gram scoring against the neural sliding-window baseline.