PR #1694
openNon-record: 11L XSA-All + EMA + Legal GPTQ on 8xH100 (1.11355 BPB)
by Rtx09xView on GitHub
val_bpb
1.1136
Architecture
Transformer
Optimizer
—
Artifact Size
15,353,950 bytes
Training Techniques
Architecture
XSA
XSA applied to all 11 layers of the model.
parameters: {"layers":11,"scope":"all"}
Weight Averaging
EMA
parameters: null
Quantization
GPTQ
bits: 6
scope: all
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: null
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Other
other
Legal autoregressive self-generated GPTQ calibration data only.
parameters: {"calibration_sequences":64,"calibration_tokens":2048,"temperature":0.8}
Novel Contributions
- Legal 8x H100 / 600s run under the 16MB cap
- 11L XSA-all stack with EMA
- Legal self-generated autoregressive GPTQ calibration
- Stronger Cholesky damping retries with percentile int6 fallback
- Explicit pre-quant checkpoint export