val_bpb
1.2663
Architecture
Depth-recurrent Transformer
Optimizer
—
Artifact Size
13.9MB
Training Techniques
Architecture
depth recurrence
5 shared transformer blocks looped 3 times for 15 effective layers at dimension 768.
parameters: {"shared_blocks":5,"loops":3,"effective_layers":15,"dimension":768}
GQA
Grouped-query attention with 12 query heads and 6 key/value heads.
parameters: {"query_heads":12,"kv_heads":6}
Other
other
Model was still improving at cutoff and had not plateaued.
parameters: {"steps":2651,"wallclock":"10min","hardware":"4xH100 SXM"}
other
Planned future work mentioned: tokenizer optimization, width/depth sweep, test-time training, and QAT.
parameters: {"tokenizer_optimization":"sp4096","width_depth_sweep":true,"test_time_training":true,"qat":true}
Novel Contributions
- Depth-recurrent transformer with 5 shared blocks looped 3 times
- 15 effective layers at dimension 768
- Grouped-query attention with 12:6 head configuration
- Reported 21.4M parameters and approximately 13.9MB compressed artifact size