PR #1080

open

Add non-record 16MB submission: quant-quality-first 1.12276 BPB

val_bpb

1.1228

Architecture

Transformer

Optimizer

—

Artifact Size

15561740 bytes

Training Techniques

Compression

zstd

level: null

Quantization

mixed int6/int8

bits: 6

scope: MLP and attention; embeddings on int8 path

int8

bits: 8

scope: embeddings

Architecture

XSA

Applied XSA to the last 4 layers.

parameters: {"layers":[7,8,9,10]}

Enabled VE in the top layers.

parameters: {"layers":[9,10]}

BigramHash

Used a bigram vocabulary/hash component.

parameters: {"vocab_size":2048}

MLP3x

Increased MLP capacity with a 3.0x multiplier.

parameters: {"multiplier":3}

GQA

Used grouped-query attention via fewer KV heads than attention heads.

parameters: {"num_heads":8,"num_kv_heads":4}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

Evaluation

sliding window eval

parameters: {"stride":64}

Quality-first mixed-quant submission targeting the 16MB cap
Mixed quantization with int6 for MLP and attention while keeping embeddings on int8
Disabled int6 packing and used auto-selected zstd compression
Used sliding-window evaluation with stride 64 to improve reported val_bpb
Applied XSA to the last 4 layers and VE to layers 9-10
Increased MLP capacity with MLP multiplier 3.0 and bigram vocabulary size 2048