PR #1378

open

Non Record: GPTQ int7 XSA BigramHash — val_bpb 1.1711

by Rajat123456789View on GitHub
val_bpb
1.1711
Architecture
Transformer
Optimizer
Muon
Artifact Size
14.84 MB

Training Techniques

Architecture
MLP3x
11-layer model with 3x MLP width (1536 hidden).
parameters: {"layers":11,"mlp_multiplier":3,"hidden_size":1536}
LeakyReLU
LeakyReLU squared activation.
parameters: {"variant":"squared"}
Partial RoPE
Partial rotary positional embeddings applied to a subset of dimensions.
parameters: {"dimensions":16,"total_dimensions":64}
XSA
XSA applied to all layers.
parameters: {"layers":11}
SmearGate
SmearGate mechanism added to the model.
parameters: null
BigramHash
BigramHash embedding with specified vocabulary and dimension.
parameters: {"vocab_size":3072,"dimension":112}
Regularization
LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
weight decay
parameters: {"value":0.04}
gradient clipping
parameters: {"clip_norm":0.3}
Weight Averaging
EMA
parameters: {"decay":0.997}
Quantization
late QAT
bits: null
scope: all
GPTQ
bits: 7
scope: all
Compression
lzma
level: 9
Evaluation
sliding window eval
parameters: {"stride":64}
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: null
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Other
other
Full Hessian-based GPTQ with Cholesky error feedback collected via forward hooks on CastedLinear layers.
parameters: null
other
Optional depth recurrence that reruns the 11 physical layers multiple times with fresh U-Net skip connections.
parameters: {"recurrence":2}

Novel Contributions

  • Full Hessian-based GPTQ post-training quantization
  • Int7 quantization with LZMA-9 artifact compression
  • BigramHash with 3072 vocabulary and 112 dimensions
  • XSA applied across all layers
  • SmearGate integration
  • Partial RoPE and LN scale modifications
  • EMA weight averaging with decay 0.997
  • Late QAT and sliding window evaluation