PR #31

closed

Non-record: Depth-recurrent 5x3 d768, val_bpb=1.2663

val_bpb

1.2663

Architecture

Depth-recurrent Transformer

Optimizer

—

Artifact Size

13.9MB

Training Techniques

Architecture

depth recurrence

5 shared transformer blocks looped 3 times for 15 effective layers at dimension 768.

parameters: {"shared_blocks":5,"loops":3,"effective_layers":15,"dimension":768}

GQA

Grouped-query attention with 12 query heads and 6 key/value heads.

parameters: {"query_heads":12,"kv_heads":6}

Other

other

Model was still improving at cutoff and had not plateaued.

parameters: {"steps":2651,"wallclock":"10min","hardware":"4xH100 SXM"}

other

Planned future work mentioned: tokenizer optimization, width/depth sweep, test-time training, and QAT.

parameters: {"tokenizer_optimization":"sp4096","width_depth_sweep":true,"test_time_training":true,"qat":true}