PR #1323

open

REHA-DEQ-WSE: Deep Equilibrium with Weight Synthesis for Parameter-Efficient Language Modeling

val_bpb

1.1247

Architecture

Transformer

Optimizer

Muon

Artifact Size

6.8 MB

Training Techniques

Architecture

depth recurrence

Uses a single layer iterated to a fixed point (Deep Equilibrium) instead of stacking 11 separate layers.

parameters: {"iterations":22}

other

Weight-Synthesis Engine: a tiny hypernetwork that adapts layer behavior based on input entropy to specialize for code, prose, or tables.

parameters: {"extra_params":152000,"bottleneck_dim":64}

XSA

Exclusive Self Attention used in the baseline model.

parameters: null

Optimizer

Muon

weight_decay: null

momentum: null

other_params: null