PR #386

open

Depth-recurrent transformer: shared block × 12 passes, val_bpb 1.4061, 4.39MB artifact

by Sambhav242005View on GitHub
val_bpb
1.4061
Architecture
Transformer
Optimizer
Artifact Size
4.39MB

Training Techniques

Architecture
depth recurrence
A single shared transformer block is applied repeatedly across 12 passes instead of using independent layers.
parameters: {"passes":12,"shared_blocks":1}
U-Net skip connections
Encoder passes store activations and decoder passes consume them in reverse order.
parameters: {"passes":12}
x0 residual mix
The original embedding is injected at every pass for stability.
parameters: null
KV head count
Uses a wider transformer configuration with more heads and fewer KV heads than the baseline.
parameters: {"num_heads":12,"num_kv_heads":4,"model_dim":768}
Compression
zlib
level: null

Novel Contributions

  • Single shared transformer block reused across 12 passes to reduce unique parameters.
  • U-Net style skip connections between encoder and decoder passes.
  • Residual injection of the original embedding at every pass for stability.
  • Wider model enabled by parameter savings from depth recurrence.
  • Int8 plus zlib roundtrip used for the final artifact.