PR #1099
openRecord: Coprime-Stride Loader + Full GPTQ + XSA-all — val_bpb 1.1133 (3-seed mean)
by BortlesboatView on GitHub
val_bpb
1.1133
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
~15.89 MB
Training Techniques
Architecture
BigramHash
Bigram hash embedding component used in the model stack.
parameters: {"vocab_size":2816,"dimension":112}
SmearGate
SmearGate module included in the architecture.
parameters: null
Partial RoPE
Rotary positional encoding applied partially to a subset of dimensions.
parameters: {"dimensions":16}
XSA
XSA applied across all layers.
parameters: {"layers":11}
LeakyReLU
MLP uses LeakyReLU squared activation.
parameters: null
Quantization
GPTQ
bits: 6
scope: all
Compression
lzma
level: null
Weight Averaging
EMA
parameters: {"decay":0.997}
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: null
Evaluation
sliding window eval
parameters: {"stride":64}
LR Schedule
warmdown
parameters: {"warmdown_iters":4000}
Regularization
LN scale
parameters: null
Other
other
Coprime-stride multi-shard data pipeline for training data loading.
parameters: null
Novel Contributions
- Coprime-stride multi-shard data pipeline
- Full Hessian GPTQ with Cholesky error compensation
- XSA on all 11 layers
- GPTQ reserve optimization to recover training steps
- FA3/FA2 graceful fallback for flash attention imports
- EMA-based final submission after FP32 SWA experimentation