PR #534

closed

Non-record: 11L Partial RoPE + XSA4 + VE128 + Tight SWA + GPTQ-lite (val_bpb=1.1804)

val_bpb

1.1804

Architecture

Transformer

Optimizer

AdamW

Artifact Size

15.95 MB

Training Techniques

Architecture

Partial RoPE

Applies RoPE to only part of the head dimensions, leaving the rest position-free.

parameters: {"dimensions":16,"total_dimensions":64}

XSA

GQA-aware self-value debiasing applied to the last layers.

parameters: {"layers":4}

VE128

Shared value embedding injection across selected layers.

parameters: {"layers":[9,10]}

MLP width

Reduced MLP hidden size to 1408 to fit within the artifact budget and allow more training steps.

parameters: {"hidden_size":1408}

Regularization

layerwise LN scale

parameters: {"scale_rule":"1/sqrt(i+1)"}

Weight Averaging

SWA

parameters: {"scale_threshold":0.2}

Quantization

QAT

bits: null

scope: late training

GPTQ-lite

bits: null

scope: all

mixed int6/int8

bits: 6

scope: layers 1-9 int6, layers 0 and 10 int8

Compression

zstd

level: 22