PR #30

closed

Non-record: Depth-recurrent 5x3 d768, val_bpb=1.2663

val_bpb

1.2663

Architecture

Depth-recurrent Transformer

Optimizer

—

Artifact Size

13.9MB

Training Techniques

Architecture

depth recurrence

5 unique transformer blocks are looped 3 times each for 15 effective layers, trading unique parameters for effective depth.

parameters: {"layers":15,"unique_blocks":5,"loops":3,"dim":768}

KV head count

Grouped-query attention with 12 query heads and 6 key/value heads.

parameters: {"heads":12,"kv_heads":6}

tied embeddings

Input and output embeddings are tied.

parameters: null

U-Net skip connections

Skip connections are used across virtual layers in the recurrent depth setup.

parameters: null

Compression

custom

level: null

Test-Time Training

full TTT

parameters: null

Quantization

QAT

bits: null

scope: post-quantization gap

Depth-recurrent transformer with 5 shared blocks looped 3x for 15 effective layers
Reallocation of parameter budget from depth to width (768 vs baseline 512)
Grouped-query attention with 12 query heads and 6 KV heads
Tied embeddings
U-Net style skip connections across virtual layers
Manual GQA KV-repeat for PyTorch 2.4 compatibility
Exploration of tokenizer optimization, width/depth sweep, test-time training, and QAT as next steps