PR #213

open

Non-record submission: recurrent 512 L3 6k (8x H100, 224s)

by estesryanView on GitHub
val_bpb
1.6004
Architecture
shared-loop recurrent transformer
Optimizer
Artifact Size

Training Techniques

Architecture
depth recurrence
Uses a shared-loop recurrent transformer with looped layers to reuse the same block multiple times.
parameters: {"model_dim":512,"num_loop_iters":3,"min_loop_iters":1}
Sequence Length
sequence_length
train_length: 1024
eval_length: 1024
Other
other
Non-record submission targeting the 10-minute 16MB track with a compact recurrent architecture and stable convergence.
parameters: {"iterations":6000,"hardware":"8x H100","runtime_seconds":224}

Novel Contributions

  • Shared-loop recurrent transformer architecture
  • Compact 512-dimensional model for the 10-minute 16MB track
  • Stable convergence within the runtime constraint
  • Uses looped layers with recurrent depth sharing