PR #1019
RECORDopenRecord: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean)
by abaybektursunView on GitHub
val_bpb
1.1147
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
~15.91 MB
Training Techniques
Quantization
GPTQ
bits: 6
scope: all
late QAT
bits: null
scope: all
Architecture
BigramHash
BigramHash embedding with wider vocabulary/dimension setting
parameters: {"vocab_size":3072,"dim":112}
XSA
Cross-position attention applied to all layers
parameters: {"layers":11}
RoPE
Partial rotary position embeddings
parameters: {"dimensions":16,"total_dimensions":64}
VE128
VE128 applied to later layers
parameters: {"layers":[9,10]}
SmearGate
Position-mixing gate
parameters: null
U-Net skip connections
Encoder-decoder skip connections
parameters: null
LeakyReLU
LeakyReLU squared MLP activation
parameters: {"squared":true}
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: null
Weight Averaging
EMA + SWA
parameters: {"ema_decay":0.997,"swa_every":50}
Compression
lzma
level: 9
LR Schedule
warmdown
parameters: {"warmdown_steps":4000}
Regularization
LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
structured pruning
parameters: {"type":"±1 by reconstruction error"}
Novel Contributions
- AR self-generated calibration data for GPTQ with no val or train data access during quantization
- Full Hessian GPTQ with Cholesky error compensation and column reordering
- BigramHash widened to 3072 × 112
- XSA applied to all 11 layers
- Removal of TTT while still improving over prior SOTA