← Back to Architecture

depth/width tradeoff

Architecture
Used in
1 PRs
Best BPB
1.3693
Avg BPB
1.3693

Hyperparameters Across PRs

pr_numberparameters
93{"layers":12,"model_dim":384,"num_heads":6,"num_kv_heads":3,"mlp_mult":2}