PR #2046

open

Non-record: Negative Results Compendium — 14 failed experiments on PR-1493→PR-1787

by nprime06View on GitHub

val_bpb

1.0634

Architecture

Transformer

Optimizer

—

Artifact Size

—

Training Techniques

Quantization

GPTQ

bits: null

scope: weights

QAT

bits: null

scope: FP8 MLP

GPTQ

bits: null

scope: per-tensor

Regularization

weight decay

parameters: {"type":"L2"}

magnitude pruning

parameters: {"type":"L1 sparsity"}

Architecture

BigramHash

Bigram training / bigram embedding approach

parameters: null

depth recurrence

Loop curriculum over recurrent depth schedule 1→2→3

parameters: {"schedule":"1->2->3"}

Other

other

Multi-token prediction auxiliary objective

parameters: null

other

Batch-size ramp / batch ramp training schedule

parameters: null

other

Dataset substitution to 100% FineWeb-Edu

parameters: null

other

max-autotune / Inductor kernel autotuning

parameters: null

other

DeepSeek NS10 training variant

parameters: null

other

Weight entropy shaping via bucket penalties

parameters: null

Novel Contributions

Negative-results compendium covering 14 failed or marginal experiment directions across the PR-1493 to PR-1787 path
Standalone weight entropy analysis showing a fundamental tension between Gaussian expressiveness and compressibility
Demonstration that L2 weight decay is scale-invariant under SDClip quantization, undermining WD-based compression ideas
Evidence that GPTQ absorbs most quantization-grid and per-tensor allocation improvements
Analysis that the 600s compute budget is too tight for overhead-heavy methods like FP8, MTP, and recompilation
Bundled analysis scripts for entropy and bucket-distribution inspection