PR #212

closed

Non-record: Negative findings on codebook quantization, magnitude pruning, multi-token prediction, embedding factorization

by mrdavtanView on GitHub
val_bpb
1.1329
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.3 MB

Training Techniques

Quantization
int6
bits: 6
scope: all weights
Compression
zstd
level: 22
Architecture
MLP3x
Expanded MLP hidden size from 1024 to 1536 (3x MLP expansion).
parameters: {"hidden":1536}
weight tying
Tied input embeddings and output embeddings.
parameters: null
KV head count
Used fewer KV heads than attention heads.
parameters: {"num_heads":8,"num_kv_heads":4}
depth recurrence
Shared-block recurrent depth setup tested as an experimental technique.
parameters: {"shared_blocks":3,"loops":3}
SmearGate
Added SmearGate and BigramHash as an experimental architectural modification.
parameters: null
BigramHash
Added BigramHash as an experimental architectural modification.
parameters: null
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"muon_backend_steps":5}
LR Schedule
warmdown
parameters: {"warmdown_iters":20000}
Regularization
gradient clipping
parameters: {"grad_clip_norm":1}
Evaluation
stride-based eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Weight Averaging
SWA
parameters: {"sample_every":50}
Test-Time Training
TTT
parameters: {"max_steps":500,"freeze_blocks":1}

Novel Contributions

  • Int6 per-row quantization with 3x MLP expansion to fit a larger model within the artifact budget.
  • Controlled ablations showing multi-token prediction did not help on this setup.
  • Negative findings on codebook quantization: K-means codebooks compressed worse than int6 despite lower reconstruction error.
  • Negative findings on magnitude pruning: small amounts of pruning increased compressed artifact size.
  • Negative findings on embedding SVD/factorization: rank-64 linear factorization was not viable for the token embedding matrix.
  • Documentation of failed depth recurrence / Huginn-style eval scaling at small scale.
  • Documentation of QAT under torch.compile issues and implementation bugs such as SWA bf16 accumulation and zstd/zlib mismatch.