PR #212

closed

Non-record: Negative findings on codebook quantization, magnitude pruning, multi-token prediction, embedding factorization

by mrdavtanView on GitHub

val_bpb

1.1329

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.3 MB

Training Techniques

Quantization

int6

bits: 6

scope: all weights

Compression

zstd

level: 22

Architecture

MLP3x

Expanded MLP hidden size from 1024 to 1536 (3x MLP expansion).

parameters: {"hidden":1536}

weight tying

Tied input embeddings and output embeddings.

parameters: null

KV head count

Used fewer KV heads than attention heads.

parameters: {"num_heads":8,"num_kv_heads":4}

depth recurrence

Shared-block recurrent depth setup tested as an experimental technique.

parameters: {"shared_blocks":3,"loops":3}

SmearGate

Added SmearGate and BigramHash as an experimental architectural modification.

parameters: null

BigramHash

Added BigramHash as an experimental architectural modification.

parameters: null

Optimizer

Muon

weight_decay: null

momentum: 0.99

other_params: {"muon_backend_steps":5}

LR Schedule

warmdown

parameters: {"warmdown_iters":20000}

Regularization

gradient clipping

parameters: {"grad_clip_norm":1}

Evaluation

stride-based eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

Weight Averaging

SWA

parameters: {"sample_every":50}

Test-Time Training

TTT

parameters: {"max_steps":500,"freeze_blocks":1}

Novel Contributions

Int6 per-row quantization with 3x MLP expansion to fit a larger model within the artifact budget.
Controlled ablations showing multi-token prediction did not help on this setup.
Negative findings on codebook quantization: K-means codebooks compressed worse than int6 despite lower reconstruction error.
Negative findings on magnitude pruning: small amounts of pruning increased compressed artifact size.
Negative findings on embedding SVD/factorization: rank-64 linear factorization was not viable for the token embedding matrix.
Documentation of failed depth recurrence / Huginn-style eval scaling at small scale.
Documentation of QAT under torch.compile issues and implementation bugs such as SWA bf16 accumulation and zstd/zlib mismatch.