PR #1473

open

Non-record: 11L FullGPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11564 (1-seed)

by AVINASH0052View on GitHub

val_bpb

1.1156

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

15,832,508 bytes

Training Techniques

Architecture

XSA

XSA applied to all layers

parameters: {"layers":11}

BigramHash

Bigram hash embedding

parameters: {"dimensions":[3072,112]}

LeakyReLU

LeakyReLU squared MLP activation

parameters: {"squared":true}

MLP3x

3x width MLP

parameters: {"width_multiplier":3}

GQA

Grouped query attention

parameters: {"heads":8,"kv_heads":4}

Partial RoPE

RoPE applied to a subset of dimensions

parameters: {"dimensions":16,"total_dimensions":64}

U-Net skip connections

Skip connections linking early and late layers in a U-Net style

parameters: {"pairs":[[0,10],[1,9],[2,8]]}

VE128

Volume embedding on later layers

parameters: {"layers":[9,10]}

SmearGate

Input smearing gate on embeddings

parameters: null

weight scaling

Shared weight scales across layers

parameters: null

Regularization

LN scale

parameters: {"scale":"1/sqrt(L+1)"}

weight decay

parameters: {"value":0.04}

Weight Averaging

EMA

parameters: {"decay":0.997}

Tight SWA

parameters: {"start_step":6150}

Quantization

late QAT

bits: 6

scope: all

GPTQ

bits: 6

scope: all

int6

bits: 6

scope: all

Compression

lzma

level: 9

Evaluation

sliding window eval

parameters: {"stride":64}

Optimizer

Parallel Muon

weight_decay: null

momentum: null

other_params: {"multi_gpu":true}

Sequence Length

sequence_length

train_length: 1024

eval_length: null

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Other

other

Full Hessian GPTQ calibration using autoregressive self-generated sequences

parameters: {"calibration_seqs":64,"calibration_tokens":2048}

Novel Contributions

XSA applied to all 11 layers
BigramHash 3072×112 embedding
Full Hessian GPTQ int6 with autoregressive self-generated calibration
Late QAT with int6 quantization
U-Net style skip connections
Partial RoPE and VE128 on later layers
Sliding window evaluation with stride 64