PR #587

open

XSA-11 + GPTQ b64/pd002 — 3-seed mean val_bpb 1.1208

by newjordanView on GitHub
val_bpb
1.1208
Architecture
Transformer
Optimizer
Artifact Size
15.56 MB

Training Techniques

Architecture
XSA
Expanded cross-self attention applied on all 11 layers
parameters: {"layers":11}
BigramHash
Bigram hashing with 2048 buckets
parameters: {"buckets":2048}
relu²
Using squared ReLU activation function
parameters: null
VE
Value embedding dimension
parameters: {"VE":128}
tied embeddings
Weight tying of embeddings
parameters: null
Quantization
GPTQ
bits: 6
scope: all
Weight Averaging
EMA
parameters: {"decay":0.997}
Compression
zstd
level: 22
Test-Time Training
full TTT
parameters: null

Novel Contributions

  • Applying cross-self attention (XSA) on all 11 layers instead of 4, improving BPB by -0.0006
  • Using GPTQ quantization with block_size=64 and percdamp=0.002 for better compression and less Hessian damping
  • Combining expanded XSA with finer GPTQ quantization to free space for larger architecture modifications