val_bpb
1.1228
Architecture
Transformer
Optimizer
—
Artifact Size
15561740 bytes
Training Techniques
Compression
zstd
level: null
Quantization
mixed int6/int8
bits: 6
scope: MLP and attention; embeddings on int8 path
int8
bits: 8
scope: embeddings
Architecture
XSA
Applied XSA to the last 4 layers.
parameters: {"layers":[7,8,9,10]}
VE
Enabled VE in the top layers.
parameters: {"layers":[9,10]}
BigramHash
Used a bigram vocabulary/hash component.
parameters: {"vocab_size":2048}
MLP3x
Increased MLP capacity with a 3.0x multiplier.
parameters: {"multiplier":3}
GQA
Used grouped-query attention via fewer KV heads than attention heads.
parameters: {"num_heads":8,"num_kv_heads":4}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Evaluation
sliding window eval
parameters: {"stride":64}
Novel Contributions
- Quality-first mixed-quant submission targeting the 16MB cap
- Mixed quantization with int6 for MLP and attention while keeping embeddings on int8
- Disabled int6 packing and used auto-selected zstd compression
- Used sliding-window evaluation with stride 64 to improve reported val_bpb
- Applied XSA to the last 4 layers and VE to layers 9-10
- Increased MLP capacity with MLP multiplier 3.0 and bigram vocabulary size 2048