PR #71

closed

Add Parameter Golf submission: Depth12 Dim416 KV4

by AntDX316View on GitHub
val_bpb
1.3509
Architecture
Transformer
Optimizer
Artifact Size
14301562 bytes

Training Techniques

Architecture
tied embeddings
Input and output embeddings are tied to reduce parameters and artifact size.
parameters: null
KV head count
Uses fewer key/value heads than attention heads.
parameters: {"num_heads":8,"num_kv_heads":4}
depth/narrow transformer
Uses a deeper but narrower Transformer layout compared with the naive baseline.
parameters: {"layers":12,"model_dim":416}
Quantization
int8
bits: 8
scope: model weights
Compression
zlib
level: null
Sequence Length
sequence_length
train_length: 1024
eval_length: null
LR Schedule
warmdown
parameters: {"warmup_steps":20,"warmdown_iters":1200}
Other
other
10-minute wallclock-limited training run on 8xH100 GPUs.
parameters: {"max_wallclock_seconds":600,"num_gpus":8}

Novel Contributions

  • Deeper/narrower Transformer configuration (12 layers, 416 model dim)
  • Reduced KV head count (8 attention heads, 4 KV heads)
  • Tied input/output embeddings
  • 10-minute 8xH100 training run under the 16MB track limit
  • Final artifact compressed with int8 + zlib