PR #1473
openNon-record: 11L FullGPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11564 (1-seed)
by AVINASH0052View on GitHub
val_bpb
1.1156
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
15,832,508 bytes
Training Techniques
Architecture
XSA
XSA applied to all layers
parameters: {"layers":11}
BigramHash
Bigram hash embedding
parameters: {"dimensions":[3072,112]}
LeakyReLU
LeakyReLU squared MLP activation
parameters: {"squared":true}
MLP3x
3x width MLP
parameters: {"width_multiplier":3}
GQA
Grouped query attention
parameters: {"heads":8,"kv_heads":4}
Partial RoPE
RoPE applied to a subset of dimensions
parameters: {"dimensions":16,"total_dimensions":64}
U-Net skip connections
Skip connections linking early and late layers in a U-Net style
parameters: {"pairs":[[0,10],[1,9],[2,8]]}
VE128
Volume embedding on later layers
parameters: {"layers":[9,10]}
SmearGate
Input smearing gate on embeddings
parameters: null
weight scaling
Shared weight scales across layers
parameters: null
Regularization
LN scale
parameters: {"scale":"1/sqrt(L+1)"}
weight decay
parameters: {"value":0.04}
Weight Averaging
EMA
parameters: {"decay":0.997}
Tight SWA
parameters: {"start_step":6150}
Quantization
late QAT
bits: 6
scope: all
GPTQ
bits: 6
scope: all
int6
bits: 6
scope: all
Compression
lzma
level: 9
Evaluation
sliding window eval
parameters: {"stride":64}
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: {"multi_gpu":true}
Sequence Length
sequence_length
train_length: 1024
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Other
other
Full Hessian GPTQ calibration using autoregressive self-generated sequences
parameters: {"calibration_seqs":64,"calibration_tokens":2048}
Novel Contributions
- XSA applied to all 11 layers
- BigramHash 3072×112 embedding
- Full Hessian GPTQ int6 with autoregressive self-generated calibration
- Late QAT with int6 quantization
- U-Net style skip connections
- Partial RoPE and VE128 on later layers
- Sliding window evaluation with stride 64