val_bpb
1.5363
Architecture
Transformer
Optimizer
—
Artifact Size
13.9 MB
Training Techniques
Architecture
depth recurrence
Removed the standard encoder/decoder split and recursively passed data through the same Block of weights 9 times during the forward pass.
parameters: {"layers":1,"repeats":9}
weight tying
Single shared block reused recursively across multiple passes.
parameters: null
RMSNorm
Used a weightless RMSNorm variant (RMSNormNoWeight) to reduce parameter count.
parameters: null
Novel Contributions
- Recursive Transformer with a single shared Block reused 9 times
- Removal of the standard encoder/decoder split
- Weightless RMSNorm variant to reduce parameters