PR #1985

open

auto golf

by yigengjiangView on GitHub
val_bpb
1.1093
Architecture
Transformer
Optimizer
Artifact Size
15,849,959 bytes

Training Techniques

Architecture
depth recurrence
Delayed mini depth recurrence applied on layers 4 and 5.
parameters: {"layers":[4,5]}
weight tying
Untied repeated MLPs are used.
parameters: null
parallel residuals
Parallel residual routing from layer 7.
parameters: {"start_layer":7}
XSA
XSA applied on the last 11 layers.
parameters: {"layers":11}
Quantization
mixed int6/int5
bits: null
scope: block weights
GPTQ
bits: null
scope: all
Compression
Brotli
level: null
Other
other
Byte shuffling used before compression.
parameters: null
Sequence Length
sequence_length
train_length: 1024
eval_length: 1024

Novel Contributions

  • Delayed mini depth recurrence on layers 4 and 5
  • Untied repeated MLPs
  • Parallel residual routing from layer 7
  • XSA on the last 11 layers
  • AR self-generated GPTQ calibration
  • Mixed int6/int5 quantization with recurrent layers forced int6
  • Brotli compression with byte shuffling