PR #1694

open

Non-record: 11L XSA-All + EMA + Legal GPTQ on 8xH100 (1.11355 BPB)

val_bpb

1.1136

Architecture

Transformer

Optimizer

—

Artifact Size

15,353,950 bytes

Training Techniques

Architecture

XSA

XSA applied to all 11 layers of the model.

parameters: {"layers":11,"scope":"all"}

Weight Averaging

EMA

parameters: null

Quantization

GPTQ

bits: 6

scope: all

Compression

lzma

level: null

Evaluation

sliding window eval

parameters: null

Sequence Length

sequence_length

train_length: 2048

eval_length: null

Other

other

Legal autoregressive self-generated GPTQ calibration data only.

parameters: {"calibration_sequences":64,"calibration_tokens":2048,"temperature":0.8}