PR #1962

open

Record candidate: PR #1855 + Adaptive Hessian-Sensitivity GPTQ Clip — val_bpb 1.06310 (3-seed mean)

by chris-colinskyView on GitHub

val_bpb

1.0631

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.9 MB

Training Techniques

Quantization

GPTQ

bits: 6

scope: matrix weights

mixed int5/int6/int7

bits: null

scope: matrix weights

Architecture

XSA

All 11 layers use XSA attention modification.

parameters: {"layers":11}

SmearGate

BOS-fixed position-mixing gate with not_bos mask.

parameters: null

U-Net skip connections

Encoder-decoder skip connections with skip gates.

parameters: null

depth recurrence

Loop layers 3–5 and run them 3 times once fraction exceeds threshold.

parameters: {"layers":[3,4,5],"repeats":3}

LeakyReLU

LeakyReLU-squared MLP activation.

parameters: {"slope":0.5}

GQA

Grouped query attention with 2:1 grouping.

parameters: {"heads":8,"kv_heads":4}

RoPE

Partial RoPE with YaRN scaling.

parameters: {"dimensions":16,"total_dimensions":64}

Regularization

logit softcap

parameters: {"value":30}

LN scale

parameters: {"scale":"1/sqrt(layer+1)"}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"steps":5}

Adam

weight_decay: 0.5

momentum: null

other_params: {"beta1":0.9,"beta2":0.99}

Weight Averaging

EMA

parameters: {"decay":0.9965}

Compression

custom

level: null

Test-Time Training

LoRA TTT

parameters: {"rank":56,"phases":3,"score_first":true}

LR Schedule

warmdown

parameters: {"warmdown_frac":0.85,"min_lr":0.1}

Other

other

Adaptive Hessian-sensitivity GPTQ clipping with per-tensor sigma selection from Hessian diagonal magnitude and weight row variance.

parameters: {"sigma_range":[6,24]}

Novel Contributions

Adaptive per-tensor Hessian-sensitivity GPTQ clipping replacing three hand-tuned clip sigmas
Preservation of the overall compression budget via binary-searched offset matching the prior log-average sigma
Demonstrated composability with LQER asymmetric quantization and phased TTT on the PR #1855 stack
Mixed-precision Hessian-based GPTQ ablation implemented and reported as a negative result
TTT_LORA_RANK reduced to 56 while maintaining phased score-first TTT