PR #429

open

Non-record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 control (val_bpb=1.1231, 8xH100 verified)

by AbhisekBasu1View on GitHub

val_bpb

1.1231

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,683,276 bytes

Training Techniques

Weight Averaging

EMA

parameters: {"decay":0.997}

Quantization

GPTQ-lite

bits: null

scope: all

QAT

bits: null

scope: all

int6

bits: 6

scope: all

LR Schedule

warmdown3500

parameters: {"warmdown_steps":3500}

Architecture

XSA

Uses XSA-last-4 attention/structure variant

parameters: {"last_n":4}

Vector embedding enhancement enabled

parameters: {"dim":128,"layers":[9,10]}

SmearGate

Added SmearGate architectural component

parameters: null

BigramHash

Added BigramHash feature/component

parameters: {"vocab_size":2048,"dim":128}

Regularization

LN Scale

parameters: null

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Novel Contributions

Validated 8xH100 SXM control run of the EMA + GPTQ-lite + warmdown3500 + QAT@0.15 stack
Improved on the earlier validated #414-class control result
Used per-row clip-percentile search for GPTQ-lite post-training quantization
Extended warmdown to 3500 iterations
Applied late QAT threshold of 0.15
Included XSA-last-4, VE128, LN Scale, SmearGate, and BigramHash modifications
Exported the final artifact with int6 + zstd-22 compression
Evaluated with sliding-window stride 64