PR #142

open

Record: Quant Quality: val_bpb=1.1925

val_bpb

1.1925

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,934,552 bytes

Training Techniques

Quantization

int8

bits: 8

scope: all

Architecture

tied embeddings

Input and output embeddings are tied.

parameters: null

Optimizer

Muon

weight_decay: null

momentum: 0.99

other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}

Compression

zlib

level: null

Sequence Length

sequence_length

train_length: 4096

eval_length: null

LR Schedule

warmdown

parameters: {"warmdown_steps":3000}

Other

other

Tighter int8 clipping percentile to retain more of the weight distribution tail.

parameters: {"int8_clip_percentile":99.99995}

other

Higher-precision per-row quantization scales using float32 instead of float16.

parameters: {"int8_per_row_scale_dtype":"float32"}