PR #1389
openNon-record: GPTQ-lite Scale Clamp Fix + 6-bit Packing + Depth Recurrence on Stack B
by Rome-1View on GitHub
val_bpb
1.7270
Architecture
Transformer
Optimizer
Muon
Artifact Size
—
Training Techniques
Architecture
depth recurrence
Shared MLP weights across recurrent layers to reuse parameters.
parameters: {"layers":[4,5]}
LeakyReLU
LeakyReLU squared activation used in the MLP.
parameters: {"squared":true,"alpha":0.5}
Partial RoPE
Partial rotary positional embedding applied to a subset of dimensions.
parameters: {"dimensions":"16/64"}
XSA
XSA applied across all layers.
parameters: {"layers":11}
Weight Averaging
EMA
parameters: {"decay":0.997}
Optimizer
Muon
weight_decay: 0.085
momentum: null
other_params: null
Regularization
LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
Quantization
GPTQ-lite
bits: 6
scope: all int6 tensors
mixed int6/int8
bits: null
scope: shared layers int8, others int6
Compression
zstd
level: 22
Novel Contributions
- Fixed GPTQ-lite scale clamp for int6 quantization by changing the minimum scale clamp to 1e-7
- Packed 4 int6 values into 3 bytes to reduce artifact size
- Forced int8 quantization for depth-recurrence shared layers to reduce compounded quantization error