← Back to Architecture

Gated Attention

Architecture
Used in
73 PRs
Best BPB
0.0281
Avg BPB
1.0038

Submissions

PR #344by aryanbhosale
1.1330
PR #413by anantdgoel
1.4525
PR #430by sahiee-dev
1.1428
PR #474by joshuaswarren
1.1690
PR #487by anantdgoel
1.1720
PR #516by Asukabot0
1.1428
PR #562by bigbag
1.1354
PR #635by aryanbhosale
1.1330
PR #638by Asukabot0
1.1164
PR #670by abaybektursun
1.1171
PR #715by Asukabot0
1.0337
PR #727by Asukabot0
0.9674
PR #733by stukenov
1.0278
PR #745by stukenov
1.0222
PR #754by aryanbhosale
1.1253
PR #758by hypery11
1.0465
PR #761by Asukabot0
0.9581
PR #763by hypery11
0.9917
PR #788by hypery11
0.9059
PR #795by hypery11
0.8881
PR #813by hypery11
0.6671
PR #828by bigbag
0.9076
PR #838by aryanbhosale
1.1215
PR #850by callithyia
0.3212
PR #864by aryanbhosale
0.2841
PR #865by aryanbhosale
0.2841
PR #871by greqone
0.8004
PR #875by shalyhinpavel
1.0226
PR #893by aryanbhosale
0.1310
PR #909by sunnypatneedi
0.8609
PR #921by TimPietrusky
0.0939
PR #925by THUQiXuan
0.0281
PR #940by antaloaalonso
0.9581
PR #950by jzgdev
1.3178
PR #952by FlashyFlash3011
1.1144
PR #963by sunnypatneedi
0.8609
PR #1001by ibarrajo
1.1188
PR #1036by ivanontech
1.1974
PR #1152by ericdatum
1.7942
PR #1159by JDAppleseed
0.3693
PR #1170by Christopher-Lee-McClendon
1.1199
PR #1185by skoustav35
0.9641
PR #1218by clarkkevRECORD
1.0978
PR #1232by Christopher-Lee-McClendon
1.0929
PR #1283by newjordan
1.1373
PR #1287by dentity007
1.1048
PR #1307by amrayach
1.1101
PR #1311by htrung1105
1.1303
PR #1410by izlley
1.1158
PR #1452by bsisduck
0.3509
PR #1454by bsisduck
0.3509
PR #1490by wisebreadloaf
1.6110
PR #1520by taka6745
1.0824
PR #1536by dexhunter
1.0775
PR #1537by pireylow
1.3971
PR #1553by Abhishek8108
1.2097
PR #1573by shivangbaveja
1.1464
PR #1585by codemath3000
1.0639
PR #1627by mike-ferguson
1.3246
PR #1633by joshkmartinez
1.0585
PR #1667by MarioPaerle
1.0714
PR #1670by dexhunter
1.0597
PR #1671by souro26
1.3827
PR #1671by souro26
1.3827
PR #1683by yunoshev
1.1280
PR #1689by chris-colinsky
1.0822
PR #1697by Buld1n
1.0812
PR #1728by mikeapedia
1.0771
PR #1734by yahya010
1.0108
PR #1736by dexhunter
1.0655
PR #1738by alertcat
1.0354
PR #1751by Pravin-dev06
1.3565
PR #1756by romeerp
1.0651

Hyperparameters Across PRs

pr_numberparameters
344
413{"bias_init":4}
430{"layers":10}
474
487{"added_params":37000}
516
562
635
638
670
715
727
733
745
754
758
761
763
788
795
813
828
838
850{"bias":4}
864
865
871
875{"layers":8,"final_attention_layer":1,"n_embd":384}
893
909
921{"layers":11,"dim":512,"heads":8,"kv_heads":4}
925{"experts":16,"hidden_size":512}
940
950
952{"weight_init":0,"bias_init":4}
963
1001
1036{"layers":12}
1152{"init":0.1}
1159{"enabled":0}
1170
1185
1218
1232{"qk_gain_init":1.5}
1283{"qk_gain_init":4}
1287
1307{"layers":[2,4,6,8,10],"window_size":512}
1311{"enabled":false}
1410{"layers":[0,2,4,6,8,10]}
1452{"layers":9}
1454{"layers":9}
1490{"layers":[1,3],"kv_heads":2}
1520
1536
1537{"layer_start":7}
1553{"qk_gain":5}
1573
1585{"start_layer":8}
1627
1633{"qk_gain_init":5.25}
1667{"width":12,"layers":11}
1670
1671
1671
1683
1689{"qk_gain":5.25}
1697{"looped_band_layers":"3..5","recur_attn_gate":1,"recur_attn_gate_scale":0.5}
1728
1734{"layers":10,"dimensions":544,"heads":8,"kv_share_stride":2}
1736{"init_std":0.005}
1738{"qk_gain":5.25}
1751
1756