val_bpb
1.1190
Architecture
U-Net
Optimizer
—
Artifact Size
15.81MB
Training Techniques
Architecture
XSA
Cross Self-Attention applied to last 4 layers
parameters: {"layers":4}
Bigram Vocab
Bigram vocabulary size set to 1536
parameters: {"vocab_size":1536}
Activation
Leaky ReLU squared activation with slope 0.5
parameters: {"slope":0.5}
Test-Time Training
score-first TTT
parameters: {"freeze_blocks":0,"grad_clip":0.8}
Quantization
GPTQ
bits: 6
scope: null
Weight Averaging
SWA
parameters: null
Compression
zstd
level: null
Novel Contributions
- Use of leaky_relu_sq activation with slope 0.5
- Application of Cross Self-Attention (XSA) in the last 4 layers
- Bigram vocabulary size increased to 1536
- Legal score-first Test Time Training (TTT) with freeze_blocks=0 and grad_clip=0.8
- GPTQ int6 quantization combined with zstd compression
- Use of Stochastic Weight Averaging (SWA)
- Late Quantization Aware Training (QAT)