PR #1537
openNon-Record: CAT, Sparsity (Structured and Hessian-Guided), MoE, KAN Negative Results
by pireylowView on GitHub
val_bpb
1.3971
Architecture
Transformer
Optimizer
Muon
Artifact Size
16,076,488 bytes
Training Techniques
Quantization
GPTQ
bits: 6
scope: all
Architecture
depth recurrence
Recurrence loop enabled across layers with a wider loop range in some runs.
parameters: {"loop_start":3,"loop_end":5}
Gated Attention
Parallel residuals / GPT-J style residual routing from layer 7+.
parameters: {"layer_start":7}
LeakyReLU
MLP activation uses LeakyReLU squared.
parameters: {"slope":0.5}
Regularization
magnitude pruning
parameters: {"pattern":"2:4 structured sparsity","scope":"MLP weight matrices"}
magnitude pruning
parameters: {"pattern":"2:4 structured sparsity","scope":"MLP weight matrices","importance":"Hessian-guided"}
Other
other
Compressor-aware training with a differentiable proxy loss encouraging weights near quantization grid points.
parameters: {"cat_every":50,"cat_weight":0.001}
other
Mixture of Experts with 4 experts per MLP and top-2 routing plus load-balancing loss.
parameters: {"experts":4,"top_k":2,"alpha":0.01}
other
KAN layers using spline-parameterized activations instead of standard MLPs.
parameters: {"grid_size":5,"order":3}
Test-Time Training
score-first TTT
parameters: null
Evaluation
sliding window eval
parameters: null
LR Schedule
warmdown
parameters: {"warmdown":0.72}
Optimizer
Muon
weight_decay: 0.095
momentum: null
other_params: {"matrix_lr":0.022}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Novel Contributions
- Compressor-aware training (CAT) for compression-friendly weights
- 2:4 structured sparsity for artifact-size reduction
- Hessian-guided structured sparsity using GPTQ Hessians
- Mixture of Experts exploration under the 16MB constraint
- KAN exploration under the 16MB constraint
- Systematic negative-results comparison against a strong baseline