PR #31

closed

Non-record: Depth-recurrent 5x3 d768, val_bpb=1.2663

by JackYoung27View on GitHub
val_bpb
1.2663
Architecture
Depth-recurrent Transformer
Optimizer
Artifact Size
13.9MB

Training Techniques

Architecture
depth recurrence
5 shared transformer blocks looped 3 times for 15 effective layers at dimension 768.
parameters: {"shared_blocks":5,"loops":3,"effective_layers":15,"dimension":768}
GQA
Grouped-query attention with 12 query heads and 6 key/value heads.
parameters: {"query_heads":12,"kv_heads":6}
Other
other
Model was still improving at cutoff and had not plateaued.
parameters: {"steps":2651,"wallclock":"10min","hardware":"4xH100 SXM"}
other
Planned future work mentioned: tokenizer optimization, width/depth sweep, test-time training, and QAT.
parameters: {"tokenizer_optimization":"sp4096","width_depth_sweep":true,"test_time_training":true,"qat":true}

Novel Contributions

  • Depth-recurrent transformer with 5 shared blocks looped 3 times
  • 15 effective layers at dimension 768
  • Grouped-query attention with 12:6 head configuration
  • Reported 21.4M parameters and approximately 13.9MB compressed artifact size