PR #1494

open

Record: XSA-all + GPTQ + FA3 dtype fix (val_bpb: 1.1220)

val_bpb

1.1220

Architecture

Transformer

Optimizer

AdamW

Artifact Size

~15.9 MB

Training Techniques

Architecture

XSA

XSA applied on all layers

parameters: {"layers":11}

BigramHash

Bigram hash embedding component

parameters: {"vocab_size":3072,"dim":112}

LeakyReLU

LeakyReLU squared activation

parameters: {"slope":0.5}

GQA

Grouped query attention with 8 attention heads and 4 KV heads

parameters: {"layers":11,"heads":8,"kv_heads":4}

RoPE

Partial rotary positional embedding

parameters: {"dimensions":16,"total_dimensions":64}

Quantization

GPTQ

bits: 6

scope: all

Weight Averaging

EMA

parameters: {"decay":0.997}

Compression

lzma

level: null

Evaluation

sliding window eval

parameters: {"stride":64}

Other

other

FA3 dtype compatibility wrapper to cast inputs to bf16 when PyTorch does not auto-cast for Flash Attention 3 calls

parameters: null