PR #609
openNon-record: 11L XSA-all + Full GPTQ + Selective Pruning (val_bpb=1.1154, 3-seed)
by saml212View on GitHub
val_bpb
1.1154
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
15.94 MB
Training Techniques
Architecture
XSA
Cross-Position Self-Attention applied on all 11 layers instead of last 4, forcing cross-position information mixing from layer 0
parameters: {"layers":11}
Selective ±1 magnitude pruning
Post-GPTQ pruning of ±1 quantized values sorted by reconstruction error (scale²), zeroing least-impactful values first until artifact fits
parameters: null
LeakyReLU(0.5)² MLP 3x
MLP with LeakyReLU activation squared, repeated 3 times
parameters: null
BigramHash
Bigram hashing with 2048 buckets
parameters: {"buckets":2048}
Partial RoPE
Rotary Positional Embeddings applied partially with parameters 16/64
parameters: {"partial_rope":"16/64"}
LN Scale
LayerNorm scaling
parameters: null
VE128
Value Embedding with dimension 128
parameters: {"dimension":128}
SmearGate
SmearGate mechanism
parameters: null
U-Net skips
Skip connections inspired by U-Net architecture
parameters: null
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: null
Weight Averaging
EMA
parameters: {"decay":0.997}
Tight SWA
parameters: null
Quantization
Full Hessian GPTQ
bits: 6
scope: int6
Compression
lzma
level: null
Novel Contributions
- Applying Cross-Position Self-Attention (XSA) on all 11 layers instead of the standard last 4 layers, improving cross-position information mixing from layer 0
- Selective ±1 magnitude pruning post-GPTQ by sorting ±1 quantized values by reconstruction error and zeroing the least impactful first until artifact fits