PR #2103
openAdd 16MB SP1024 Value Residual + PPM mixture submission ppm_mix_bpb 0.829467
by lkk688View on GitHub
val_bpb
0.8295
Architecture
Transformer
Optimizer
—
Artifact Size
15,806,135 bytes
Training Techniques
Architecture
Value Residual
Enabled value residual connections in the last 2 layers of the Transformer.
parameters: {"last_n_layers":2}
GQA
Used grouped-query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
XSA
XSA enabled as part of the experimental architecture scaffold.
parameters: null
BiFPN2
BiFPN2 mode enabled as part of the experimental architecture scaffold.
parameters: null
MLP3x
Transformer MLP multiplier set to 2.
parameters: {"mlp_mult":2}
weight tying
SentencePiece tokenizer-based compact model setup; no explicit weight tying was stated, so this is not included.
parameters: null
Weight Averaging
EMA
parameters: null
Quantization
late QAT
bits: null
scope: null
Evaluation
byte-level PPM mixture
parameters: {"order":5,"confidence_threshold":0.9,"lambda_lo":0.1,"lambda_hi":0.75}
Novel Contributions
- SentencePiece 1024 tokenizer compact line for a 16MB submission
- Value Residual enabled in the last 2 Transformer layers
- Byte-level PPM mixture at evaluation time
- Compact 9-layer 512d Transformer with 8 attention heads and 4 KV heads
- Artifact fits under the 16MB limit