PR #93

open

Non-record: Compact 12x384 1xH100 10m

val_bpb

1.3693

Architecture

Transformer

Optimizer

—

Artifact Size

9,668,102 bytes

Training Techniques

Architecture

tied embeddings

Uses tied input/output embeddings to reduce artifact size.

parameters: null

depth/width tradeoff

Uses a compact Transformer with reduced width and increased depth to improve compression/quality tradeoff under the size cap.

parameters: {"layers":12,"model_dim":384,"num_heads":6,"num_kv_heads":3,"mlp_mult":2}

Sequence Length

sequence_length

train_length: 1024

eval_length: null

Compression

zlib

level: null

Compact 12-layer, 384-dimension Transformer configuration under a 10-minute wallclock budget on 1x H100
Width reduction with added depth to explore a size/quality tradeoff under the 16MB artifact cap
Tied embeddings to reduce serialized model size
Non-record negative-result datapoint comparing artifact size against a stronger baseline