PR #1475

open

Non-record: 8xH100->1xH100 Two-Stage GPTQ Baseline — val_bpb 1.13072, 15,651,808 bytes

val_bpb

1.1307

Architecture

Transformer

Optimizer

—

Artifact Size

15,651,808 bytes

Training Techniques

Quantization

GPTQ

bits: 6

scope: all

Architecture

BigramHash

Bigram hash embedding used in the base stack

parameters: {"dimensions":"3072 x 112"}

XSA

XSA applied to all 11 layers in the base stack

parameters: {"layers":11}

Compression

lzma

level: 9

Weight Averaging

EMA

parameters: null

Evaluation

sliding window eval

parameters: null

Regularization

pruning

parameters: null

Validated two-stage 8xH100 -> 1xH100 execution path
Stage 1 training and checkpoint export on 8xH100
Stage 2 GPTQ, artifact packing, and final evaluation on 1xH100
Saved result under the 16,000,000 byte cap
Demonstrated that GPTQ and final evaluation can be moved off the expensive 8xH100 box
Documented a reusable non-record baseline for future compliant reruns