PR #1378

open

Non Record: GPTQ int7 XSA BigramHash — val_bpb 1.1711

by Rajat123456789View on GitHub

val_bpb

1.1711

Architecture

Transformer

Optimizer

Muon

Artifact Size

14.84 MB

Training Techniques

Architecture

MLP3x

11-layer model with 3x MLP width (1536 hidden).

parameters: {"layers":11,"mlp_multiplier":3,"hidden_size":1536}

LeakyReLU

LeakyReLU squared activation.

parameters: {"variant":"squared"}

Partial RoPE

Partial rotary positional embeddings applied to a subset of dimensions.

parameters: {"dimensions":16,"total_dimensions":64}

XSA

XSA applied to all layers.

parameters: {"layers":11}

SmearGate

SmearGate mechanism added to the model.

parameters: null

BigramHash

BigramHash embedding with specified vocabulary and dimension.

parameters: {"vocab_size":3072,"dimension":112}

Regularization

LN scale

parameters: {"scale":"1/sqrt(layer+1)"}

weight decay

parameters: {"value":0.04}

gradient clipping

parameters: {"clip_norm":0.3}

Weight Averaging

EMA

parameters: {"decay":0.997}

Quantization

late QAT

bits: null

scope: all

GPTQ

bits: 7

scope: all

Compression

lzma

level: 9

Evaluation

sliding window eval

parameters: {"stride":64}

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: null

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Other

other

Full Hessian-based GPTQ with Cholesky error feedback collected via forward hooks on CastedLinear layers.

parameters: null

other

Optional depth recurrence that reruns the 11 physical layers multiple times with fresh U-Net skip connections.

parameters: {"recurrence":2}

Novel Contributions

Full Hessian-based GPTQ post-training quantization
Int7 quantization with LZMA-9 artifact compression
BigramHash with 3072 vocabulary and 112 dimensions
XSA applied across all layers
SmearGate integration
Partial RoPE and LN scale modifications
EMA weight averaging with decay 0.997
Late QAT and sliding window evaluation