PR #1537

open

Non-Record: CAT, Sparsity (Structured and Hessian-Guided), MoE, KAN Negative Results

by pireylowView on GitHub

val_bpb

1.3971

Architecture

Transformer

Optimizer

Muon

Artifact Size

16,076,488 bytes

Training Techniques

Quantization

GPTQ

bits: 6

scope: all

Architecture

depth recurrence

Recurrence loop enabled across layers with a wider loop range in some runs.

parameters: {"loop_start":3,"loop_end":5}

Gated Attention

Parallel residuals / GPT-J style residual routing from layer 7+.

parameters: {"layer_start":7}

LeakyReLU

MLP activation uses LeakyReLU squared.

parameters: {"slope":0.5}

Regularization

magnitude pruning

parameters: {"pattern":"2:4 structured sparsity","scope":"MLP weight matrices"}

magnitude pruning

parameters: {"pattern":"2:4 structured sparsity","scope":"MLP weight matrices","importance":"Hessian-guided"}

Other

other

Compressor-aware training with a differentiable proxy loss encouraging weights near quantization grid points.

parameters: {"cat_every":50,"cat_weight":0.001}

other

Mixture of Experts with 4 experts per MLP and top-2 routing plus load-balancing loss.

parameters: {"experts":4,"top_k":2,"alpha":0.01}

other

KAN layers using spline-parameterized activations instead of standard MLPs.

parameters: {"grid_size":5,"order":3}

Test-Time Training

score-first TTT

parameters: null

Evaluation

sliding window eval

parameters: null

LR Schedule

warmdown

parameters: {"warmdown":0.72}

Optimizer

Muon

weight_decay: 0.095

momentum: null

other_params: {"matrix_lr":0.022}

Weight Averaging

EMA

parameters: {"decay":0.9965}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

Novel Contributions

Compressor-aware training (CAT) for compression-friendly weights
2:4 structured sparsity for artifact-size reduction
Hessian-guided structured sparsity using GPTQ Hessians
Mixture of Experts exploration under the 16MB constraint
KAN exploration under the 16MB constraint
Systematic negative-results comparison against a strong baseline