val_bpb
1.1208
Architecture
Transformer
Optimizer
—
Artifact Size
15.56 MB
Training Techniques
Architecture
XSA
Expanded cross-self attention applied on all 11 layers
parameters: {"layers":11}
BigramHash
Bigram hashing with 2048 buckets
parameters: {"buckets":2048}
relu²
Using squared ReLU activation function
parameters: null
VE
Value embedding dimension
parameters: {"VE":128}
tied embeddings
Weight tying of embeddings
parameters: null
Quantization
GPTQ
bits: 6
scope: all
Weight Averaging
EMA
parameters: {"decay":0.997}
Compression
zstd
level: 22
Test-Time Training
full TTT
parameters: null
Novel Contributions
- Applying cross-self attention (XSA) on all 11 layers instead of 4, improving BPB by -0.0006
- Using GPTQ quantization with block_size=64 and percdamp=0.002 for better compression and less Hessian damping
- Combining expanded XSA with finer GPTQ quantization to free space for larger architecture modifications