PR #656

open

Three Breadsticks: 1.1190 BPB

by newjordanView on GitHub
val_bpb
1.1190
Architecture
U-Net
Optimizer
Artifact Size
15.81MB

Training Techniques

Architecture
XSA
Cross Self-Attention applied to last 4 layers
parameters: {"layers":4}
Bigram Vocab
Bigram vocabulary size set to 1536
parameters: {"vocab_size":1536}
Activation
Leaky ReLU squared activation with slope 0.5
parameters: {"slope":0.5}
Test-Time Training
score-first TTT
parameters: {"freeze_blocks":0,"grad_clip":0.8}
Quantization
GPTQ
bits: 6
scope: null
Weight Averaging
SWA
parameters: null
Compression
zstd
level: null

Novel Contributions

  • Use of leaky_relu_sq activation with slope 0.5
  • Application of Cross Self-Attention (XSA) in the last 4 layers
  • Bigram vocabulary size increased to 1536
  • Legal score-first Test Time Training (TTT) with freeze_blocks=0 and grad_clip=0.8
  • GPTQ int6 quantization combined with zstd compression
  • Use of Stochastic Weight Averaging (SWA)
  • Late Quantization Aware Training (QAT)