← Back to Architecture

depth recurrence, weight tying, tied embeddings, RoPE, ReLU² MLP 3×, GQA

Architecture
Used in
1 PRs
Best BPB
1.1750
Avg BPB
1.1750

Hyperparameters Across PRs

pr_numberparameters
575{"layers":6,"loop_blocks":2,"loop_iters":3,"embed_dim":512,"num_heads":8,"num_kv_heads":8,"mlp_expansion":3}