PR #717

open

Grant nonrecord tied blocks

val_bpb

1.3515

Architecture

GPT

Optimizer

—

Artifact Size

12,622,882 bytes

Training Techniques

Quantization

int8

bits: 8

scope: model weights

Architecture

GQA attention

Uses grouped-query attention in the model stack.

parameters: null

Other

other

Late-window STE applied on CastedLinear during fake int8 quantization.

parameters: {"module":"CastedLinear","window":"late"}

other

torch.compile used with fullgraph disabled.

parameters: {"fullgraph":false}