PR #30

closed

Non-record: Depth-recurrent 5x3 d768, val_bpb=1.2663

by JackYoung27View on GitHub
val_bpb
1.2663
Architecture
Depth-recurrent Transformer
Optimizer
Artifact Size
13.9MB

Training Techniques

Architecture
depth recurrence
5 unique transformer blocks are looped 3 times each for 15 effective layers, trading unique parameters for effective depth.
parameters: {"layers":15,"unique_blocks":5,"loops":3,"dim":768}
KV head count
Grouped-query attention with 12 query heads and 6 key/value heads.
parameters: {"heads":12,"kv_heads":6}
tied embeddings
Input and output embeddings are tied.
parameters: null
U-Net skip connections
Skip connections are used across virtual layers in the recurrent depth setup.
parameters: null
Compression
custom
level: null
Test-Time Training
full TTT
parameters: null
Quantization
QAT
bits: null
scope: post-quantization gap

Novel Contributions

  • Depth-recurrent transformer with 5 shared blocks looped 3x for 15 effective layers
  • Reallocation of parameter budget from depth to width (768 vs baseline 512)
  • Grouped-query attention with 12 query heads and 6 KV heads
  • Tied embeddings
  • U-Net style skip connections across virtual layers
  • Manual GQA KV-repeat for PyTorch 2.4 compatibility
  • Exploration of tokenizer optimization, width/depth sweep, test-time training, and QAT as next steps