PR #1962

open

Record candidate: PR #1855 + Adaptive Hessian-Sensitivity GPTQ Clip — val_bpb 1.06310 (3-seed mean)

by chris-colinskyView on GitHub
val_bpb
1.0631
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.9 MB

Training Techniques

Quantization
GPTQ
bits: 6
scope: matrix weights
mixed int5/int6/int7
bits: null
scope: matrix weights
Architecture
XSA
All 11 layers use XSA attention modification.
parameters: {"layers":11}
SmearGate
BOS-fixed position-mixing gate with not_bos mask.
parameters: null
U-Net skip connections
Encoder-decoder skip connections with skip gates.
parameters: null
depth recurrence
Loop layers 3–5 and run them 3 times once fraction exceeds threshold.
parameters: {"layers":[3,4,5],"repeats":3}
LeakyReLU
LeakyReLU-squared MLP activation.
parameters: {"slope":0.5}
GQA
Grouped query attention with 2:1 grouping.
parameters: {"heads":8,"kv_heads":4}
RoPE
Partial RoPE with YaRN scaling.
parameters: {"dimensions":16,"total_dimensions":64}
Regularization
logit softcap
parameters: {"value":30}
LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"steps":5}
Adam
weight_decay: 0.5
momentum: null
other_params: {"beta1":0.9,"beta2":0.99}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Compression
custom
level: null
Test-Time Training
LoRA TTT
parameters: {"rank":56,"phases":3,"score_first":true}
LR Schedule
warmdown
parameters: {"warmdown_frac":0.85,"min_lr":0.1}
Other
other
Adaptive Hessian-sensitivity GPTQ clipping with per-tensor sigma selection from Hessian diagonal magnitude and weight row variance.
parameters: {"sigma_range":[6,24]}

Novel Contributions

  • Adaptive per-tensor Hessian-sensitivity GPTQ clipping replacing three hand-tuned clip sigmas
  • Preservation of the overall compression budget via binary-searched offset matching the prior log-average sigma
  • Demonstrated composability with LQER asymmetric quantization and phased TTT on the PR #1855 stack
  • Mixed-precision Hessian-based GPTQ ablation implemented and reported as a negative result
  • TTT_LORA_RANK reduced to 56 while maintaining phased score-first TTT