PR #34

closed

[Partial submission] naive baseline + dispersion loss

by ChenLiu-1996View on GitHub

val_bpb

1.2244

Architecture

Transformer

Optimizer

—

Artifact Size

15,863,489 bytes

Training Techniques

Architecture

tied embeddings

Input and output embeddings are tied.

parameters: null

KV head count

Uses fewer KV heads than attention heads.

parameters: {"num_heads":8,"num_kv_heads":4}

Regularization

dispersion loss

parameters: null

Quantization

int8

bits: 8

scope: all

Compression

zlib

level: null

Sequence Length

sequence_length

train_length: 1024

eval_length: null

Other

other

Training used a 10-minute wallclock cap on 8xH100 with periodic validation every 200 steps on the full validation split.

parameters: {"max_wallclock_seconds":600,"num_gpus":8,"val_every_steps":200}