val_bpb
1.2092
Architecture
Transformer
Optimizer
Muon
Artifact Size
~14.39 MB
Training Techniques
Architecture
LeakyReLU
MLP activation changed from ReLU² to LeakyReLU(0.75)².
parameters: {"negative_slope":0.75}
Partial RoPE
Rotary embedding applied to only part of each head dimension.
parameters: {"dimensions":16,"head_dimensions":64}
XSA
XSA enabled only in the deepest layers.
parameters: {"layers":4}
FlashAttention-3
Standard SDPA replaced with FlashAttention-3 for attention computation.
parameters: null
Quantization
mixed int6
bits: 6
scope: model weights
Compression
lzma
level: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"matrix_lr":null}
Novel Contributions
- LeakyReLU(0.75)² replaces ReLU² in the MLP.
- Partial RoPE is used with 16 of 64 head dimensions.
- XSA is applied only to the last 4 layers.
- FlashAttention-3 is used instead of standard SDPA.
- GPTQ-style mixed int6 export with lZMA compression and selective pruning.