PR #93

open

Non-record: Compact 12x384 1xH100 10m

by aamodbhattView on GitHub
val_bpb
1.3693
Architecture
Transformer
Optimizer
Artifact Size
9,668,102 bytes

Training Techniques

Architecture
tied embeddings
Uses tied input/output embeddings to reduce artifact size.
parameters: null
depth/width tradeoff
Uses a compact Transformer with reduced width and increased depth to improve compression/quality tradeoff under the size cap.
parameters: {"layers":12,"model_dim":384,"num_heads":6,"num_kv_heads":3,"mlp_mult":2}
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Compression
zlib
level: null

Novel Contributions

  • Compact 12-layer, 384-dimension Transformer configuration under a 10-minute wallclock budget on 1x H100
  • Width reduction with added depth to explore a size/quality tradeoff under the 16MB artifact cap
  • Tied embeddings to reduce serialized model size
  • Non-record negative-result datapoint comparing artifact size against a stronger baseline