PR #534

closed

Non-record: 11L Partial RoPE + XSA4 + VE128 + Tight SWA + GPTQ-lite (val_bpb=1.1804)

val_bpb
1.1804
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15.95 MB

Training Techniques

Architecture
Partial RoPE
Applies RoPE to only part of the head dimensions, leaving the rest position-free.
parameters: {"dimensions":16,"total_dimensions":64}
XSA
GQA-aware self-value debiasing applied to the last layers.
parameters: {"layers":4}
VE128
Shared value embedding injection across selected layers.
parameters: {"layers":[9,10]}
MLP width
Reduced MLP hidden size to 1408 to fit within the artifact budget and allow more training steps.
parameters: {"hidden_size":1408}
Regularization
layerwise LN scale
parameters: {"scale_rule":"1/sqrt(i+1)"}
Weight Averaging
SWA
parameters: {"scale_threshold":0.2}
Quantization
QAT
bits: null
scope: late training
GPTQ-lite
bits: null
scope: all
mixed int6/int8
bits: 6
scope: layers 1-9 int6, layers 0 and 10 int8
Compression
zstd
level: 22

Novel Contributions

  • MLP hidden size reduced from 1536 to 1408 to fit under the 16MB limit.
  • Narrower MLP enabled 33% more training steps within the same time budget.
  • Combination of frontier techniques from prior PRs with GPTQ-lite quantization.