PR #1048

open

Non-record: Compression moonshots — 8 negative/marginal findings (Procrustes, SWA smoothness, selective fp16, pruning+zstd)

by mrdavtanView on GitHub

val_bpb

1.1724

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,175,136 bytes

Training Techniques

Quantization

int6

bits: 6

scope: all weights

GPTQ-lite

bits: 6

scope: well-conditioned weights

fp16

bits: 16

scope: selected embedding rows

Architecture

MLP3x

Expanded MLP hidden size to 3x baseline (1536 vs 1024).

parameters: {"hidden":1536}

weight tying

Tied input and output embeddings.

parameters: null

Weight Averaging

SWA

parameters: null

Compression

zstd

level: 22

Regularization

magnitude pruning

parameters: {"sparsity":0.03}

Optimizer

Muon

weight_decay: null

momentum: 0.99

other_params: {"backend_steps":5}

LR Schedule

warmdown

parameters: {"warmdown_iters":20000}

Evaluation

sliding window eval

parameters: {"stride":64}

Novel Contributions

Int6 per-row quantization with 3x MLP expansion achieved the best reported validation score.
Systematic ablation study of nine techniques, including SWA, doc-isolated evaluation, curriculum learning, multi-token prediction, SmearGate + BigramHash, depth recurrence, and int8 QAT.
Checkpoint-analysis findings showing Procrustes rotational structure across layers and across seeds, but with no artifact-size benefit.
Selective fp16 embedding export based on embedding entropy to reduce artifact size.
Observation that small amounts of magnitude pruning can increase compressed artifact size due to interaction with zstd.
Identification of a block-7 quantization outlier with unusually high kurtosis, suggesting selective fp16 protection.
Finding that SWA produces smoother weights and smaller artifacts than EMA, but only if step count is preserved.