PR #728
openRecord: Val-Calibrated GPTQ + XSA-all + BigramHash 3072×112
by abaybektursunView on GitHub
val_bpb
1.1142
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
~15.86 MB
Training Techniques
Quantization
GPTQ
bits: 6
scope: all
Architecture
XSA
XSA attention applied on all layers
parameters: {"layers":11}
BigramHash
Wider bigram hash embedding/table used to improve quality while staying under artifact budget
parameters: {"vocab_size":3072,"dimension":112}
MLP3x
Three-times widened MLP with LeakyReLU squared
parameters: {"layers":11}
Partial RoPE
Rotary positional embeddings applied to a subset of dimensions
parameters: {"dimensions":16,"base_dimensions":64}
SmearGate
Position-mixing gate
parameters: null
U-Net skips
Encoder-decoder skip connections
parameters: null
KV head count
Attention uses 8 GQA heads and 4 KV heads
parameters: {"gqa_heads":8,"kv_heads":4}
Weight Averaging
EMA + SWA
parameters: {"ema_decay":0.997,"swa_every":50}
Compression
lzma
level: 9
LR Schedule
warmdown
parameters: {"warmdown_iters":4000}
Regularization
layerwise LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
Evaluation
sliding window eval
parameters: {"stride":64}
Other
other
Validation-data GPTQ calibration using forward-only Hessian collection on validation tokens instead of training tokens
parameters: {"calib_batches":64}
other
Selective ±1 pruning by reconstruction error
parameters: null
other
Parallel Muon optimizer with parameter banking and overlapped communication
parameters: {"parameter_banks":4}
Novel Contributions
- Validation-data GPTQ calibration to avoid eval-time training-data access
- BigramHash widened to 3072 × 112
- Full Hessian GPTQ int6 quantization with val calibration
- XSA-all stack combined with selective pruning and artifact-budget tuning
- Parallel Muon optimizer context enabling ~6.95k steps in 600s