PR #2103

open

Add 16MB SP1024 Value Residual + PPM mixture submission ppm_mix_bpb 0.829467

val_bpb

0.8295

Architecture

Transformer

Optimizer

—

Artifact Size

15,806,135 bytes

Training Techniques

Architecture

Value Residual

Enabled value residual connections in the last 2 layers of the Transformer.

parameters: {"last_n_layers":2}

GQA

Used grouped-query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

XSA

XSA enabled as part of the experimental architecture scaffold.

parameters: null

BiFPN2

BiFPN2 mode enabled as part of the experimental architecture scaffold.

parameters: null

MLP3x

Transformer MLP multiplier set to 2.

parameters: {"mlp_mult":2}

weight tying

SentencePiece tokenizer-based compact model setup; no explicit weight tying was stated, so this is not included.

parameters: null

Weight Averaging

EMA

parameters: null

Quantization

late QAT

bits: null

scope: null

Evaluation

byte-level PPM mixture

parameters: {"order":5,"confidence_threshold":0.9,"lambda_lo":0.1,"lambda_hi":0.75}