PR #2046
openNon-record: Negative Results Compendium — 14 failed experiments on PR-1493→PR-1787
by nprime06View on GitHub
val_bpb
1.0634
Architecture
Transformer
Optimizer
—
Artifact Size
—
Training Techniques
Quantization
GPTQ
bits: null
scope: weights
QAT
bits: null
scope: FP8 MLP
GPTQ
bits: null
scope: per-tensor
Regularization
weight decay
parameters: {"type":"L2"}
magnitude pruning
parameters: {"type":"L1 sparsity"}
Architecture
BigramHash
Bigram training / bigram embedding approach
parameters: null
depth recurrence
Loop curriculum over recurrent depth schedule 1→2→3
parameters: {"schedule":"1->2->3"}
Other
other
Multi-token prediction auxiliary objective
parameters: null
other
Batch-size ramp / batch ramp training schedule
parameters: null
other
Dataset substitution to 100% FineWeb-Edu
parameters: null
other
max-autotune / Inductor kernel autotuning
parameters: null
other
DeepSeek NS10 training variant
parameters: null
other
Weight entropy shaping via bucket penalties
parameters: null
Novel Contributions
- Negative-results compendium covering 14 failed or marginal experiment directions across the PR-1493 to PR-1787 path
- Standalone weight entropy analysis showing a fundamental tension between Gaussian expressiveness and compressibility
- Demonstration that L2 weight decay is scale-invariant under SDClip quantization, undermining WD-based compression ideas
- Evidence that GPTQ absorbs most quantization-grid and per-tensor allocation improvements
- Analysis that the 600s compute budget is too tight for overhead-heavy methods like FP8, MTP, and recompilation
- Bundled analysis scripts for entropy and bucket-distribution inspection