PR #1913

open

[Submission] Multi-Temperature AR GPTQ Calibration on SP8192 Stack — 1.0847 BPB

by Jeffrey-LeView on GitHub

val_bpb

1.0847

Architecture

Transformer

Optimizer

Muon

Artifact Size

16.04MB

Training Techniques

Quantization

GPTQ

bits: null

scope: all

Architecture

depth recurrence

SP8192 stack uses repeated middle layers as part of the base architecture.

parameters: {"layers":11,"repeated_layers":"3-5","repeat_count":2}

XSA

XSA is used on all layers.

parameters: {"layers":11}

weight tying

Tokenizer/embedding tying is implied by the canonical stack description.

parameters: null

U-Net skip connections

Parallel residuals are used from layer 7 onward.

parameters: {"start_layer":7}

parallel residuals

Parallel residual pathway is enabled in later layers.

parameters: {"start_layer":7}

EMA

Exponential moving average is used with the base stack.

parameters: {"decay":0.9965}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"row_normalization":true}

Test-Time Training

full TTT

parameters: {"learning_rate":0.005,"epochs":3,"schedule":"cosine decay"}

LR Schedule

cosine decay

parameters: null

Sequence Length

sequence_length

train_length: 512

eval_length: null

Evaluation

sliding window eval

parameters: null

Compression

Brotli

level: 11

Novel Contributions

Multi-temperature autoregressive calibration for GPTQ instead of single-temperature calibration
Weighted temperature mixture [0.5, 0.8, 1.1, 1.4] with counts [8, 24, 24, 8] for Hessian estimation
Shorter calibration sequence length of 512 to reduce generation time without hurting BPB
Ablation study showing the selected temperature spread outperforms single-temperature and wider-spread alternatives
Fully self-contained calibration with no external data post-training