PR #1353

open

Non-record: 11L XSA-All + EMA + Legal GPTQ on 1xH100 PCIe (1.1546 bpb)

val_bpb
1.1547
Architecture
Transformer
Optimizer
Artifact Size
15,243,770 bytes

Training Techniques

Architecture
XSA
All-layer XSA architecture variant used in the submitted model.
parameters: {"layers":11,"scope":"all"}
Weight Averaging
EMA
parameters: null
Quantization
GPTQ
bits: 6
scope: all
Compression
lzma
level: null
Regularization
weight decay
parameters: {"higher_than_default":true}
Sequence Length
sequence_length
train_length: 1024
eval_length: null

Novel Contributions

  • 11-layer XSA-all model variant
  • EMA during training
  • Legal self-generated GPTQ export
  • Hardened GPTQ export with Cholesky retry and damping for non-PD Hessians
  • Fallback to percentile int6 quantization when Hessian factorization fails
  • Explicit pre-quant checkpoint saved before export
  • Non-record unlimited-compute submission under the 16MB cap