PR #1913

open

[Submission] Multi-Temperature AR GPTQ Calibration on SP8192 Stack — 1.0847 BPB

by Jeffrey-LeView on GitHub
val_bpb
1.0847
Architecture
Transformer
Optimizer
Muon
Artifact Size
16.04MB

Training Techniques

Quantization
GPTQ
bits: null
scope: all
Architecture
depth recurrence
SP8192 stack uses repeated middle layers as part of the base architecture.
parameters: {"layers":11,"repeated_layers":"3-5","repeat_count":2}
XSA
XSA is used on all layers.
parameters: {"layers":11}
weight tying
Tokenizer/embedding tying is implied by the canonical stack description.
parameters: null
U-Net skip connections
Parallel residuals are used from layer 7 onward.
parameters: {"start_layer":7}
parallel residuals
Parallel residual pathway is enabled in later layers.
parameters: {"start_layer":7}
EMA
Exponential moving average is used with the base stack.
parameters: {"decay":0.9965}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"row_normalization":true}
Test-Time Training
full TTT
parameters: {"learning_rate":0.005,"epochs":3,"schedule":"cosine decay"}
LR Schedule
cosine decay
parameters: null
Sequence Length
sequence_length
train_length: 512
eval_length: null
Evaluation
sliding window eval
parameters: null
Compression
Brotli
level: 11

Novel Contributions

  • Multi-temperature autoregressive calibration for GPTQ instead of single-temperature calibration
  • Weighted temperature mixture [0.5, 0.8, 1.1, 1.4] with counts [8, 24, 24, 8] for Hessian estimation
  • Shorter calibration sequence length of 512 to reduce generation time without hurting BPB
  • Ablation study showing the selected temperature spread outperforms single-temperature and wider-spread alternatives
  • Fully self-contained calibration with no external data post-training