← Back to Architecture

attention sink

Architecture
Used in
2 PRs
Best BPB
1.0738
Avg BPB
1.0762

Hyperparameters Across PRs

pr_numberparameters
1895{"sink_keys_per_kv_head":1}
1980