PR #648

closed

Depth Recurrence (3+3 x 2 loops) + HW Optimizations

by maorinkaView on GitHub
val_bpb
1.1428
Architecture
Transformer
Optimizer
Artifact Size
~10MB

Training Techniques

Architecture
depth recurrence
Replaces 10 unique transformer layers with 6 unique layers (3 encoder + 3 decoder) looped twice to create 12 effective layers while sharing parameters.
parameters: {"unique_layers":6,"encoder_layers":3,"decoder_layers":3,"num_loops":2,"effective_layers":12}
weight tying
Tied encoder and decoder blocks are reused across loops with per-loop conditioning so repeated passes can behave differently.
parameters: {"num_loops":2}
Regularization
layerwise LN scale
parameters: {"per_loop_scale_bias":true}
LR Schedule
learning rate scaling
parameters: {"scale":"1/sqrt(num_loops)"}
Weight Averaging
SWA
parameters: {"gpu_resident":true,"dtype":"float32"}
Other
other
Async data prefetch using a background thread and separate CUDA stream to overlap data loading with compute.
parameters: null
other
Pinned memory for faster host-to-device transfers.
parameters: null
other
NCCL tuning for H100 NVLink topology.
parameters: {"NCCL_NVLS_ENABLE":1,"NCCL_NET_GDR_LEVEL":5}
other
GPU-resident SWA accumulation to avoid device-to-host synchronization per checkpoint.
parameters: null
other
Called torch.cuda.empty_cache() after warmup to reduce memory fragmentation.
parameters: null

Novel Contributions

  • Depth recurrence with 6 unique layers looped twice to produce 12 effective layers
  • Per-loop conditioning via learned scale and bias for tied blocks
  • U-Net-style skip connections preserved across effective layers with LIFO behavior
  • Learning-rate scaling by 1/sqrt(num_loops) for tied-weight recurrence
  • Async data prefetch and pinned-memory training pipeline optimizations
  • NCCL tuning for H100 NVLink
  • GPU-resident SWA accumulation
  • Cache cleanup after warmup to reduce fragmentation