PR #1818

open

Non-record: Post-Quantization Damage Gap — 11L GPTQ Int6 + Curriculum + Sliding TTT (3-seed, 8xH100)

by taka6745View on GitHub
val_bpb
1.1009
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,696,362 bytes

Training Techniques

Architecture
GQA
Grouped-query attention in the transformer stack.
parameters: null
Partial RoPE
Rotary embeddings applied to only part of the head dimensions.
parameters: {"dimensions":16}
XSA
Extended sparse attention with sliding-window plus global tokens on later layers.
parameters: {"layers":4}
U-Net skip connections
Asymmetric encoder-decoder skip connections with learned skip weights.
parameters: null
LeakyReLU
LeakyReLU squared activation used in the MLP.
parameters: {"squared":true}
weight tying
Token embedding and LM head are tied.
parameters: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"newton_schulz_iterations":3}
AdamW
weight_decay: null
momentum: null
other_params: {"used_for":"embeddings_and_scalars"}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"learning_rate_schedule":"cosine decay"}
LR Schedule
cosine decay
parameters: {"start":0.0001,"end":0.000001}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Quantization
GPTQ
bits: 6
scope: matrix weights
mixed int6/int5
bits: null
scope: weights and embeddings
GPTQ
bits: 6
scope: calibration and post-training quantization
Regularization
layerwise LN scale
parameters: null
Other
other
Entropy-bucket curriculum sampler with wallclock-driven easy-to-hard crossfade and floor weight.
parameters: {"floor":0.02,"buckets":8}
other
Freeze-dry linear-reconstruction storage filter that drops weights predictable from row neighbors.
parameters: {"neighbors":2}
other
2:4 sparsity packing for storage-only compression using 3-bit values and position codes.
parameters: {"block_size":4,"survivors_per_block":2}
Compression
zstd
level: 22

Novel Contributions

  • Post-quantization damage gap: a large, reproducible degradation from GPTQ int6 despite strong pre-quant performance.
  • Entropy-bucket curriculum sampler with wallclock-driven easy-to-hard sampling and floor weight.
  • Freeze-dry: local linear-reconstruction-based storage filtering using neighboring weights.
  • 2:4 sparsity packing for storage-side compression with 3-bit values and compact position codes.
  • Observation that TTT partially recovers from quantization damage but does not close the gap.