PR #1818

open

Non-record: Post-Quantization Damage Gap — 11L GPTQ Int6 + Curriculum + Sliding TTT (3-seed, 8xH100)

by taka6745View on GitHub

val_bpb

1.1009

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,696,362 bytes

Training Techniques

Architecture

GQA

Grouped-query attention in the transformer stack.

parameters: null

Partial RoPE

Rotary embeddings applied to only part of the head dimensions.

parameters: {"dimensions":16}

XSA

Extended sparse attention with sliding-window plus global tokens on later layers.

parameters: {"layers":4}

U-Net skip connections

Asymmetric encoder-decoder skip connections with learned skip weights.

parameters: null

LeakyReLU

LeakyReLU squared activation used in the MLP.

parameters: {"squared":true}

weight tying

Token embedding and LM head are tied.

parameters: null

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"newton_schulz_iterations":3}

AdamW

weight_decay: null

momentum: null

other_params: {"used_for":"embeddings_and_scalars"}

Weight Averaging

EMA

parameters: {"decay":0.9965}

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

score-first TTT

parameters: {"learning_rate_schedule":"cosine decay"}

LR Schedule

cosine decay

parameters: {"start":0.0001,"end":0.000001}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

Quantization

GPTQ

bits: 6

scope: matrix weights

mixed int6/int5

bits: null

scope: weights and embeddings

GPTQ

bits: 6

scope: calibration and post-training quantization

Regularization

layerwise LN scale

parameters: null

Other

other

Entropy-bucket curriculum sampler with wallclock-driven easy-to-hard crossfade and floor weight.

parameters: {"floor":0.02,"buckets":8}

other

Freeze-dry linear-reconstruction storage filter that drops weights predictable from row neighbors.

parameters: {"neighbors":2}

other

2:4 sparsity packing for storage-only compression using 3-bit values and position codes.

parameters: {"block_size":4,"survivors_per_block":2}

Compression

zstd

level: 22

Novel Contributions

Post-quantization damage gap: a large, reproducible degradation from GPTQ int6 despite strong pre-quant performance.
Entropy-bucket curriculum sampler with wallclock-driven easy-to-hard sampling and floor weight.
Freeze-dry: local linear-reconstruction-based storage filtering using neighboring weights.
2:4 sparsity packing for storage-side compression with 3-bit values and compact position codes.
Observation that TTT partially recovers from quantization damage but does not close the gap.