PR #1048

open

Non-record: Compression moonshots — 8 negative/marginal findings (Procrustes, SWA smoothness, selective fp16, pruning+zstd)

by mrdavtanView on GitHub
val_bpb
1.1724
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,175,136 bytes

Training Techniques

Quantization
int6
bits: 6
scope: all weights
GPTQ-lite
bits: 6
scope: well-conditioned weights
fp16
bits: 16
scope: selected embedding rows
Architecture
MLP3x
Expanded MLP hidden size to 3x baseline (1536 vs 1024).
parameters: {"hidden":1536}
weight tying
Tied input and output embeddings.
parameters: null
Weight Averaging
SWA
parameters: null
Compression
zstd
level: 22
Regularization
magnitude pruning
parameters: {"sparsity":0.03}
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"backend_steps":5}
LR Schedule
warmdown
parameters: {"warmdown_iters":20000}
Evaluation
sliding window eval
parameters: {"stride":64}

Novel Contributions

  • Int6 per-row quantization with 3x MLP expansion achieved the best reported validation score.
  • Systematic ablation study of nine techniques, including SWA, doc-isolated evaluation, curriculum learning, multi-token prediction, SmearGate + BigramHash, depth recurrence, and int8 QAT.
  • Checkpoint-analysis findings showing Procrustes rotational structure across layers and across seeds, but with no artifact-size benefit.
  • Selective fp16 embedding export based on embedding entropy to reduce artifact size.
  • Observation that small amounts of magnitude pruning can increase compressed artifact size due to interaction with zstd.
  • Identification of a block-7 quantization outlier with unusually high kurtosis, suggesting selective fp16 protection.
  • Finding that SWA produces smoother weights and smaller artifacts than EMA, but only if step count is preserved.