PR #2103

open

Add 16MB SP1024 Value Residual + PPM mixture submission ppm_mix_bpb 0.829467

val_bpb
0.8295
Architecture
Transformer
Optimizer
Artifact Size
15,806,135 bytes

Training Techniques

Architecture
Value Residual
Enabled value residual connections in the last 2 layers of the Transformer.
parameters: {"last_n_layers":2}
GQA
Used grouped-query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
XSA
XSA enabled as part of the experimental architecture scaffold.
parameters: null
BiFPN2
BiFPN2 mode enabled as part of the experimental architecture scaffold.
parameters: null
MLP3x
Transformer MLP multiplier set to 2.
parameters: {"mlp_mult":2}
weight tying
SentencePiece tokenizer-based compact model setup; no explicit weight tying was stated, so this is not included.
parameters: null
Weight Averaging
EMA
parameters: null
Quantization
late QAT
bits: null
scope: null
Evaluation
byte-level PPM mixture
parameters: {"order":5,"confidence_threshold":0.9,"lambda_lo":0.1,"lambda_hi":0.75}

Novel Contributions

  • SentencePiece 1024 tokenizer compact line for a 16MB submission
  • Value Residual enabled in the last 2 Transformer layers
  • Byte-level PPM mixture at evaluation time
  • Compact 9-layer 512d Transformer with 8 attention heads and 4 KV heads
  • Artifact fits under the 16MB limit