val_bpb
1.2663
Architecture
Depth-recurrent Transformer
Optimizer
—
Artifact Size
13.9MB
Training Techniques
Architecture
depth recurrence
5 unique transformer blocks are looped 3 times each for 15 effective layers, trading unique parameters for effective depth.
parameters: {"layers":15,"unique_blocks":5,"loops":3,"dim":768}
KV head count
Grouped-query attention with 12 query heads and 6 key/value heads.
parameters: {"heads":12,"kv_heads":6}
tied embeddings
Input and output embeddings are tied.
parameters: null
U-Net skip connections
Skip connections are used across virtual layers in the recurrent depth setup.
parameters: null
Compression
custom
level: null
Test-Time Training
full TTT
parameters: null
Quantization
QAT
bits: null
scope: post-quantization gap
Novel Contributions
- Depth-recurrent transformer with 5 shared blocks looped 3x for 15 effective layers
- Reallocation of parameter budget from depth to width (768 vs baseline 512)
- Grouped-query attention with 12 query heads and 6 KV heads
- Tied embeddings
- U-Net style skip connections across virtual layers
- Manual GQA KV-repeat for PyTorch 2.4 compatibility
- Exploration of tokenizer optimization, width/depth sweep, test-time training, and QAT as next steps