PR #2046

open

Non-record: Negative Results Compendium — 14 failed experiments on PR-1493→PR-1787

by nprime06View on GitHub
val_bpb
1.0634
Architecture
Transformer
Optimizer
Artifact Size

Training Techniques

Quantization
GPTQ
bits: null
scope: weights
QAT
bits: null
scope: FP8 MLP
GPTQ
bits: null
scope: per-tensor
Regularization
weight decay
parameters: {"type":"L2"}
magnitude pruning
parameters: {"type":"L1 sparsity"}
Architecture
BigramHash
Bigram training / bigram embedding approach
parameters: null
depth recurrence
Loop curriculum over recurrent depth schedule 1→2→3
parameters: {"schedule":"1->2->3"}
Other
other
Multi-token prediction auxiliary objective
parameters: null
other
Batch-size ramp / batch ramp training schedule
parameters: null
other
Dataset substitution to 100% FineWeb-Edu
parameters: null
other
max-autotune / Inductor kernel autotuning
parameters: null
other
DeepSeek NS10 training variant
parameters: null
other
Weight entropy shaping via bucket penalties
parameters: null

Novel Contributions

  • Negative-results compendium covering 14 failed or marginal experiment directions across the PR-1493 to PR-1787 path
  • Standalone weight entropy analysis showing a fundamental tension between Gaussian expressiveness and compressibility
  • Demonstration that L2 weight decay is scale-invariant under SDClip quantization, undermining WD-based compression ideas
  • Evidence that GPTQ absorbs most quantization-grid and per-tensor allocation improvements
  • Analysis that the 600s compute budget is too tight for overhead-heavy methods like FP8, MTP, and recompilation
  • Bundled analysis scripts for entropy and bucket-distribution inspection