PR #1323
openREHA-DEQ-WSE: Deep Equilibrium with Weight Synthesis for Parameter-Efficient Language Modeling
by sohvView on GitHub
val_bpb
1.1247
Architecture
Transformer
Optimizer
Muon
Artifact Size
6.8 MB
Training Techniques
Architecture
depth recurrence
Uses a single layer iterated to a fixed point (Deep Equilibrium) instead of stacking 11 separate layers.
parameters: {"iterations":22}
other
Weight-Synthesis Engine: a tiny hypernetwork that adapts layer behavior based on input entropy to specialize for code, prose, or tables.
parameters: {"extra_params":152000,"bottleneck_dim":64}
XSA
Exclusive Self Attention used in the baseline model.
parameters: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: null
Novel Contributions
- Deep Equilibrium model using one layer run repeatedly until convergence
- Weight-Synthesis Engine hypernetwork for input-adaptive parameter modulation
- Fits the model within the 16MB constraint while improving BPB
- Reported reproducible improvement across three random seeds