PR #587

open

XSA-11 + GPTQ b64/pd002 — 3-seed mean val_bpb 1.1208

val_bpb

1.1208

Architecture

Transformer

Optimizer

—

Artifact Size

15.56 MB

Training Techniques

Architecture

XSA

Expanded cross-self attention applied on all 11 layers

parameters: {"layers":11}

BigramHash

Bigram hashing with 2048 buckets

parameters: {"buckets":2048}

relu²

Using squared ReLU activation function

parameters: null

Value embedding dimension

parameters: {"VE":128}

tied embeddings

Weight tying of embeddings

parameters: null

Quantization

GPTQ

bits: 6

scope: all

Weight Averaging

EMA

parameters: {"decay":0.997}

Compression

zstd

level: 22

Test-Time Training

full TTT

parameters: null

Applying cross-self attention (XSA) on all 11 layers instead of 4, improving BPB by -0.0006
Using GPTQ quantization with block_size=64 and percdamp=0.002 for better compression and less Hessian damping
Combining expanded XSA with finer GPTQ quantization to free space for larger architecture modifications