val_bpb
0.8705
Architecture
Transformer
Optimizer
—
Artifact Size
15825448 bytes
Training Techniques
Architecture
MLP3x
Transformer with 3x MLP expansion
parameters: {"expansion":3}
GQA
Uses 8 attention heads and 4 KV heads
parameters: {"attention_heads":8,"kv_heads":4}
LeakyReLU
MLP activation changed to LeakyReLU(0.5)^2
parameters: {"slope":0.5,"squared":true}
XSA
XSA on late layers
parameters: {"layers":"late"}
Partial RoPE
Uses partial rotary positional embeddings
parameters: null
Regularization
LN scale
parameters: null
Weight Averaging
EMA
parameters: null
Quantization
late QAT
bits: null
scope: model
GPTQ-lite
bits: 6
scope: all
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Compression
zstd
level: null
Novel Contributions
- LeakyReLU(0.5)^2 MLP activation in place of relu^2
- EMA-based 11-layer Transformer record attempt
- GPTQ-lite int6 export with roundtrip verification
- Late QAT at threshold 0.15
- Portability fixes for non-FA3 environments