← Back to Architecture

Transformer

Architecture
Used in
8 PRs
Best BPB
1.0717
Avg BPB
1.2851

Hyperparameters Across PRs

pr_numberparameters
284{"layers":10}
298{"dim":768}
724{"layers":10,"dimensions":512,"gqa":"8/4","bigramhash_buckets":10240}
985{"dimensions":800,"layers":6,"heads":10}
1116{"layers":11}
1167{"layers":10}
1357{"layers":12,"model_dim":512,"attention_heads":8,"kv_heads":4,"mlp_multiplier":3,"mlp_hidden":1536,"rope_dims":"16/64","vocab_size":1024,"bigram_buckets":1536}
1505{"layers":11}