PR #651

open

[WIP] Record: Hybrid architecture 8L 3:1 GDN/Transformer (val_bpb=1.2093)

val_bpb
1.2093
Architecture
Hybrid GDN/Transformer
Optimizer
Artifact Size

Training Techniques

Architecture
Hybrid GDN/Transformer
A hybrid architecture combining GDN and Transformer layers in an 8-layer model with a 3:1 ratio
parameters: {"layers":8,"ratio":"3:1"}
Quantization
planned but not implemented
bits: null
scope: null
Test-Time Training
TTT
parameters: null
Other
other
Importation of tricks from top leaderboard solutions
parameters: null

Novel Contributions

  • Hybrid architecture combining GDN and Transformer layers with an 8-layer 3:1 ratio
  • Incorporation of test-time training (TTT) techniques
  • Planned quantization and hyperparameter optimizations
  • Importation of tricks from top leaderboard solutions