PR #429

open

Non-record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 control (val_bpb=1.1231, 8xH100 verified)

by AbhisekBasu1View on GitHub
val_bpb
1.1231
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,683,276 bytes

Training Techniques

Weight Averaging
EMA
parameters: {"decay":0.997}
Quantization
GPTQ-lite
bits: null
scope: all
QAT
bits: null
scope: all
int6
bits: 6
scope: all
LR Schedule
warmdown3500
parameters: {"warmdown_steps":3500}
Architecture
XSA
Uses XSA-last-4 attention/structure variant
parameters: {"last_n":4}
VE
Vector embedding enhancement enabled
parameters: {"dim":128,"layers":[9,10]}
SmearGate
Added SmearGate architectural component
parameters: null
BigramHash
Added BigramHash feature/component
parameters: {"vocab_size":2048,"dim":128}
Regularization
LN Scale
parameters: null
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}

Novel Contributions

  • Validated 8xH100 SXM control run of the EMA + GPTQ-lite + warmdown3500 + QAT@0.15 stack
  • Improved on the earlier validated #414-class control result
  • Used per-row clip-percentile search for GPTQ-lite post-training quantization
  • Extended warmdown to 3500 iterations
  • Applied late QAT threshold of 0.15
  • Included XSA-last-4, VE128, LN Scale, SmearGate, and BigramHash modifications
  • Exported the final artifact with int6 + zstd-22 compression
  • Evaluated with sliding-window stride 64