PR #651

open

[WIP] Record: Hybrid architecture 8L 3:1 GDN/Transformer (val_bpb=1.2093)

val_bpb

1.2093

Architecture

Hybrid GDN/Transformer

Optimizer

—

Artifact Size

—

Training Techniques

Architecture

Hybrid GDN/Transformer

A hybrid architecture combining GDN and Transformer layers in an 8-layer model with a 3:1 ratio

parameters: {"layers":8,"ratio":"3:1"}

Quantization

planned but not implemented

bits: null

scope: null

Test-Time Training

TTT

parameters: null

Other

other

Importation of tricks from top leaderboard solutions

parameters: null

Hybrid architecture combining GDN and Transformer layers with an 8-layer 3:1 ratio
Incorporation of test-time training (TTT) techniques
Planned quantization and hyperparameter optimizations
Importation of tricks from top leaderboard solutions