PR #213

open

Non-record submission: recurrent 512 L3 6k (8x H100, 224s)

val_bpb

1.6004

Architecture

shared-loop recurrent transformer

Optimizer

—

Artifact Size

—

Training Techniques

Architecture

depth recurrence

Uses a shared-loop recurrent transformer with looped layers to reuse the same block multiple times.

parameters: {"model_dim":512,"num_loop_iters":3,"min_loop_iters":1}

Sequence Length

sequence_length

train_length: 1024

eval_length: 1024

Other

other

Non-record submission targeting the 10-minute 16MB track with a compact recurrent architecture and stable convergence.

parameters: {"iterations":6000,"hardware":"8x H100","runtime_seconds":224}