val_bpb
1.1220
Architecture
Transformer
Optimizer
AdamW
Artifact Size
~15.9 MB
Training Techniques
Architecture
XSA
XSA applied on all layers
parameters: {"layers":11}
BigramHash
Bigram hash embedding component
parameters: {"vocab_size":3072,"dim":112}
LeakyReLU
LeakyReLU squared activation
parameters: {"slope":0.5}
GQA
Grouped query attention with 8 attention heads and 4 KV heads
parameters: {"layers":11,"heads":8,"kv_heads":4}
RoPE
Partial rotary positional embedding
parameters: {"dimensions":16,"total_dimensions":64}
Quantization
GPTQ
bits: 6
scope: all
Weight Averaging
EMA
parameters: {"decay":0.997}
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Other
other
FA3 dtype compatibility wrapper to cast inputs to bf16 when PyTorch does not auto-cast for Flash Attention 3 calls
parameters: null
Novel Contributions
- FA3 dtype compatibility wrapper for PyTorch 2.5.1 Hopper attention
- XSA on all layers
- Full Hessian GPTQ with AR self-gen calibration
- BigramHash 3072×112
- EMA with decay 0.997
- LeakyReLU squared activation