PR #1060

open

Record: 1.1123 BPB — Coprime-Stride Loader + Full GPTQ + XSA-all (3-seed mean)

by dexhunterView on GitHub
val_bpb
1.1123
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
~15.99 MB

Training Techniques

Architecture
BigramHash
Expanded bigram hash embedding from the prior scaffold to capture more bigram patterns.
parameters: {"vocab_size":2816,"dimensions":112}
XSA
Exclusive Self-Attention applied to all layers instead of only the last few layers.
parameters: {"layers":11}
LeakyReLU
Uses LeakyReLU squared MLP activation in the base scaffold.
parameters: {"layers":11}
Quantization
GPTQ
bits: 6
scope: all
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Other
other
Coprime-stride multi-shard data pipeline that samples blocks from multiple shards with coprime strides to increase batch diversity.
parameters: {"shards":"multi-shard","stride_scheme":"coprime"}
Regularization
LN scale
parameters: null
Weight Averaging
EMA
parameters: {"decay":0.997}

Novel Contributions

  • Coprime-stride multi-shard data pipeline
  • Full Hessian GPTQ with Cholesky error compensation
  • XSA extended to all 11 layers
  • BigramHash enlarged to 2816x112
  • No TTT; sliding window evaluation outperformed test-time training