PR #1389

open

Non-record: GPTQ-lite Scale Clamp Fix + 6-bit Packing + Depth Recurrence on Stack B

val_bpb
1.7270
Architecture
Transformer
Optimizer
Muon
Artifact Size

Training Techniques

Architecture
depth recurrence
Shared MLP weights across recurrent layers to reuse parameters.
parameters: {"layers":[4,5]}
LeakyReLU
LeakyReLU squared activation used in the MLP.
parameters: {"squared":true,"alpha":0.5}
Partial RoPE
Partial rotary positional embedding applied to a subset of dimensions.
parameters: {"dimensions":"16/64"}
XSA
XSA applied across all layers.
parameters: {"layers":11}
Weight Averaging
EMA
parameters: {"decay":0.997}
Optimizer
Muon
weight_decay: 0.085
momentum: null
other_params: null
Regularization
LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
Quantization
GPTQ-lite
bits: 6
scope: all int6 tensors
mixed int6/int8
bits: null
scope: shared layers int8, others int6
Compression
zstd
level: 22

Novel Contributions

  • Fixed GPTQ-lite scale clamp for int6 quantization by changing the minimum scale clamp to 1e-7
  • Packed 4 int6 values into 3 bytes to reduce artifact size
  • Forced int8 quantization for depth-recurrence shared layers to reduce compounded quantization error