PR #716

open

Add non-record 4090 warmdown submission

val_bpb

1.4239

Architecture

Transformer

Optimizer

—

Artifact Size

14,624,248 bytes

Training Techniques

Architecture

tied embeddings

Input and output embeddings are tied.

parameters: null

KV head count

Uses fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

LeakyReLU2

Replaces the default MLP activation with a LeakyReLU-squared variant.

parameters: {"slope":0.5}

LR Schedule

warmdown

parameters: {"warmdown_iters":300}

Sequence Length

sequence_length

train_length: 1024

eval_length: null

Quantization

int8

bits: 8

scope: all

Compression

zlib

level: null

Other

other

torch.compile enabled for training/evaluation speedup on a single RTX 4090.

parameters: {"hardware":"1x RTX 4090","wallclock_seconds":300}