PR #1080

open

Add non-record 16MB submission: quant-quality-first 1.12276 BPB

val_bpb
1.1228
Architecture
Transformer
Optimizer
Artifact Size
15561740 bytes

Training Techniques

Compression
zstd
level: null
Quantization
mixed int6/int8
bits: 6
scope: MLP and attention; embeddings on int8 path
int8
bits: 8
scope: embeddings
Architecture
XSA
Applied XSA to the last 4 layers.
parameters: {"layers":[7,8,9,10]}
VE
Enabled VE in the top layers.
parameters: {"layers":[9,10]}
BigramHash
Used a bigram vocabulary/hash component.
parameters: {"vocab_size":2048}
MLP3x
Increased MLP capacity with a 3.0x multiplier.
parameters: {"multiplier":3}
GQA
Used grouped-query attention via fewer KV heads than attention heads.
parameters: {"num_heads":8,"num_kv_heads":4}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Evaluation
sliding window eval
parameters: {"stride":64}

Novel Contributions

  • Quality-first mixed-quant submission targeting the 16MB cap
  • Mixed quantization with int6 for MLP and attention while keeping embeddings on int8
  • Disabled int6 packing and used auto-selected zstd compression
  • Used sliding-window evaluation with stride 64 to improve reported val_bpb
  • Applied XSA to the last 4 layers and VE to layers 9-10
  • Increased MLP capacity with MLP multiplier 3.0 and bigram vocabulary size 2048