PR #195

open

Add chasewebb 9x512 sp1024 baseline (val_bpb: 1.2355)

val_bpb

1.2355

Architecture

Transformer

Optimizer

—

Artifact Size

15.87MB

Training Techniques

Quantization

int8

bits: 8

scope: all

Architecture

GQA

Grouped-query attention with 8 attention heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

tied embeddings

Input and output embeddings are tied.

parameters: null

Sequence Length

sequence_length

train_length: 1024

eval_length: null

Compression

zlib

level: null