PR #1542

open

Submission: Recursive Layer Sharing | 13.9 MB | 1.53 BPB

by negrurvView on GitHub
val_bpb
1.5363
Architecture
Transformer
Optimizer
Artifact Size
13.9 MB

Training Techniques

Architecture
depth recurrence
Removed the standard encoder/decoder split and recursively passed data through the same Block of weights 9 times during the forward pass.
parameters: {"layers":1,"repeats":9}
weight tying
Single shared block reused recursively across multiple passes.
parameters: null
RMSNorm
Used a weightless RMSNorm variant (RMSNormNoWeight) to reduce parameter count.
parameters: null

Novel Contributions

  • Recursive Transformer with a single shared Block reused 9 times
  • Removal of the standard encoder/decoder split
  • Weightless RMSNorm variant to reduce parameters