PR #386
openDepth-recurrent transformer: shared block × 12 passes, val_bpb 1.4061, 4.39MB artifact
by Sambhav242005View on GitHub
val_bpb
1.4061
Architecture
Transformer
Optimizer
—
Artifact Size
4.39MB
Training Techniques
Architecture
depth recurrence
A single shared transformer block is applied repeatedly across 12 passes instead of using independent layers.
parameters: {"passes":12,"shared_blocks":1}
U-Net skip connections
Encoder passes store activations and decoder passes consume them in reverse order.
parameters: {"passes":12}
x0 residual mix
The original embedding is injected at every pass for stability.
parameters: null
KV head count
Uses a wider transformer configuration with more heads and fewer KV heads than the baseline.
parameters: {"num_heads":12,"num_kv_heads":4,"model_dim":768}
Compression
zlib
level: null
Novel Contributions
- Single shared transformer block reused across 12 passes to reduce unique parameters.
- U-Net style skip connections between encoder and decoder passes.
- Residual injection of the original embedding at every pass for stability.
- Wider model enabled by parameter savings from depth recurrence.
- Int8 plus zlib roundtrip used for the final artifact.