PR #2091

open

Add ppm mix nonrecord submission - SP8192 + Value Residual + Byte-Level PPM Mixture val_bpb=1.16 ppm_mix_bpb=0.83

val_bpb
1.1586
Architecture
Transformer
Optimizer
Artifact Size
15973626 bytes

Training Techniques

Architecture
Value Residual
Adds value residual connections to the transformer.
parameters: null
XSA
Uses XSA on the last layers as part of the structured skip fusion setup.
parameters: {"layers":4}
weight tying
Shares V across the last 3 layers.
parameters: {"layers":3}
U-Net skip connections
Uses structured skip fusion / BIFPN2-style skip connections.
parameters: null
BigramHash
Uses a 2-gram scaffold with fade-out.
parameters: {"n":2}
Sequence Length
sequence_length
train_length: 8192
eval_length: null
Other
other
Tokenizer scaling / capacity scaling across iterative ablations.
parameters: null
other
Byte-level PPM mixture used to obtain the mixed score.
parameters: null

Novel Contributions

  • Incremental research process with environment-controlled switches for ablations
  • Tokenizer scaling
  • Capacity scaling
  • Value Residual
  • Byte-level PPM mixture
  • Structured skip fusion / BIFPN2 mode
  • Shared V across the last 3 layers
  • XSA on the last 4 layers
  • 2-gram scaffold with fade-out