PR #1111

open

The Kitchen Sink

by MichaelMcCullochView on GitHub

val_bpb

0.2532

Architecture

Transformer

Optimizer

Muon

Artifact Size

~11.4MB

Training Techniques

Architecture

ReLU²

Squared ReLU activation used in the model.

parameters: null

LeakyReLU

Leaky ReLU squared activation variant used in the model.

parameters: null

HybridNorm

Mixed pre-norm and post-norm scheme, with post-norm in deeper layers.

parameters: null

SmearGate

SmearGate combined with BigramHash for token mixing.

parameters: null

BigramHash

Bigram hash / embedding component used in the architecture.

parameters: null

Differential Attention

Attention modification using differential attention.

parameters: null

WaveletGPT

Wavelet-based GPT architectural variant.

parameters: null

VGA

VGA architectural component included in the model.

parameters: null

Multi-Token Prediction

Auxiliary multi-token prediction heads used during training.

parameters: null

Optimizer

Muon

weight_decay: null

momentum: null

other_params: null

Quantization

QAT

bits: 6

scope: all

GPTQ

bits: null

scope: all

Regularization

magnitude pruning

parameters: {"sparsity":"2%"}

Other

other

OptRot Hadamard rotation applied before quantization to improve error distribution.

parameters: null

Compression

zlib

level: null

Test-Time Training

LoRA TTT

parameters: {"rank":8}

Evaluation

n-gram cache

parameters: {"orders":"2-10","buckets":4000000,"entropy_adaptive_alpha":true}

kNN-LM

parameters: {"projection":"1024->64","storage":"fp16"}

TurboQuant KV cache compression

parameters: {"bits":3}

Weight Averaging

EMA

parameters: null

SWA

parameters: null

Novel Contributions

Combines many techniques into a single env-var-toggleable pipeline
Uses int6 QAT from step 0 with GPTQ and pruning to fit under the artifact limit
Applies per-document LoRA test-time training
Adds entropy-adaptive n-gram backoff caching
Adds kNN-LM with random projection and fp16 storage
Uses TurboQuant KV cache compression
Reports strong ablation results across three seeds