← Back to Architecture

residual mixing

Architecture
Used in
7 PRs
Best BPB
1.0577
Avg BPB
1.2732

Hyperparameters Across PRs

pr_numberparameters
103
443
460
790{"layers":11,"dimensions":512,"mlp_multiplier":3.5,"mha":"8/8","bigramhash":8192,"xsa":"all layers"}
1180
1527
2159