PR #1323

open

REHA-DEQ-WSE: Deep Equilibrium with Weight Synthesis for Parameter-Efficient Language Modeling

val_bpb
1.1247
Architecture
Transformer
Optimizer
Muon
Artifact Size
6.8 MB

Training Techniques

Architecture
depth recurrence
Uses a single layer iterated to a fixed point (Deep Equilibrium) instead of stacking 11 separate layers.
parameters: {"iterations":22}
other
Weight-Synthesis Engine: a tiny hypernetwork that adapts layer behavior based on input entropy to specialize for code, prose, or tables.
parameters: {"extra_params":152000,"bottleneck_dim":64}
XSA
Exclusive Self Attention used in the baseline model.
parameters: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: null

Novel Contributions

  • Deep Equilibrium model using one layer run repeatedly until convergence
  • Weight-Synthesis Engine hypernetwork for input-adaptive parameter modulation
  • Fits the model within the 16MB constraint while improving BPB
  • Reported reproducible improvement across three random seeds