PR #656

open

Three Breadsticks: 1.1190 BPB

val_bpb

1.1190

Architecture

U-Net

Optimizer

—

Artifact Size

15.81MB

Training Techniques

Architecture

XSA

Cross Self-Attention applied to last 4 layers

parameters: {"layers":4}

Bigram Vocab

Bigram vocabulary size set to 1536

parameters: {"vocab_size":1536}

Activation

Leaky ReLU squared activation with slope 0.5

parameters: {"slope":0.5}

Test-Time Training

score-first TTT

parameters: {"freeze_blocks":0,"grad_clip":0.8}

Quantization

GPTQ

bits: 6

scope: null

Weight Averaging

SWA

parameters: null

Compression

zstd

level: null

Use of leaky_relu_sq activation with slope 0.5
Application of Cross Self-Attention (XSA) in the last 4 layers
Bigram vocabulary size increased to 1536
Legal score-first Test Time Training (TTT) with freeze_blocks=0 and grad_clip=0.8
GPTQ int6 quantization combined with zstd compression
Use of Stochastic Weight Averaging (SWA)
Late Quantization Aware Training (QAT)