PR #1689
openSP8192 + Adaptive Hessian-Sensitivity GPTQ Clipping — 1.0822 bpb
by chris-colinskyView on GitHub
val_bpb
1.0822
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.91 MB
Training Techniques
Quantization
GPTQ
bits: 6
scope: attention/MLP matrices
GPTQ
bits: 8
scope: embeddings
Architecture
depth recurrence
Reuses layers in a recurrent encoder/decoder pattern to create virtual layers from fewer physical layers.
parameters: {"layers":[3,4,5]}
Gated Attention
Learnable per-head query scaling (QK-Gain).
parameters: {"qk_gain":5.25}
Parallel Residuals
Attention and MLP branches read from the same pre-residual input in later layers.
parameters: {"start_layer":7}
LeakyReLU
Uses LeakyReLU squared activation.
parameters: {"slope":0.5}
Partial RoPE
Applies rotary position embeddings to only part of the head dimensions.
parameters: {"dimensions":16,"total_dimensions":64}
weight tying
Tied input and output embeddings.
parameters: null
U-Net skip connections
Skip connections with sigmoid gating.
parameters: null
Optimizer
Muon
weight_decay: 0.095
momentum: null
other_params: {"variant":"MuonEq-R","newton_schulz_steps":5,"adamw_for":"embeddings/scalars"}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Compression
zlib
level: null
Evaluation
sliding window eval
parameters: null
LR Schedule
warmdown
parameters: {"final_fraction":0.72}
Regularization
logit softcap
parameters: {"value":30}
layerwise LN scale
parameters: null
Novel Contributions
- Per-tensor adaptive GPTQ clip_sigmas derived from Hessian sensitivity
- Binary-search offset to preserve the compression budget while adapting clipping per tensor
- Hessian-sensitivity-based clipping using H_diag and row variance