PR #651
open[WIP] Record: Hybrid architecture 8L 3:1 GDN/Transformer (val_bpb=1.2093)
by phulinView on GitHub
val_bpb
1.2093
Architecture
Hybrid GDN/Transformer
Optimizer
—
Artifact Size
—
Training Techniques
Architecture
Hybrid GDN/Transformer
A hybrid architecture combining GDN and Transformer layers in an 8-layer model with a 3:1 ratio
parameters: {"layers":8,"ratio":"3:1"}
Quantization
planned but not implemented
bits: null
scope: null
Test-Time Training
TTT
parameters: null
Other
other
Importation of tricks from top leaderboard solutions
parameters: null
Novel Contributions
- Hybrid architecture combining GDN and Transformer layers with an 8-layer 3:1 ratio
- Incorporation of test-time training (TTT) techniques
- Planned quantization and hyperparameter optimizations
- Importation of tricks from top leaderboard solutions