PR #648

closed

Depth Recurrence (3+3 x 2 loops) + HW Optimizations

by maorinkaView on GitHub

val_bpb

1.1428

Architecture

Transformer

Optimizer

—

Artifact Size

~10MB

Training Techniques

Architecture

depth recurrence

Replaces 10 unique transformer layers with 6 unique layers (3 encoder + 3 decoder) looped twice to create 12 effective layers while sharing parameters.

parameters: {"unique_layers":6,"encoder_layers":3,"decoder_layers":3,"num_loops":2,"effective_layers":12}

weight tying

Tied encoder and decoder blocks are reused across loops with per-loop conditioning so repeated passes can behave differently.

parameters: {"num_loops":2}

Regularization

layerwise LN scale

parameters: {"per_loop_scale_bias":true}

LR Schedule

learning rate scaling

parameters: {"scale":"1/sqrt(num_loops)"}

Weight Averaging

SWA

parameters: {"gpu_resident":true,"dtype":"float32"}

Other

other

Async data prefetch using a background thread and separate CUDA stream to overlap data loading with compute.

parameters: null

other

Pinned memory for faster host-to-device transfers.

parameters: null

other

NCCL tuning for H100 NVLink topology.

parameters: {"NCCL_NVLS_ENABLE":1,"NCCL_NET_GDR_LEVEL":5}

other

GPU-resident SWA accumulation to avoid device-to-host synchronization per checkpoint.

parameters: null

other

Called torch.cuda.empty_cache() after warmup to reduce memory fragmentation.

parameters: null

Novel Contributions

Depth recurrence with 6 unique layers looped twice to produce 12 effective layers
Per-loop conditioning via learned scale and bias for tied blocks
U-Net-style skip connections preserved across effective layers with LIFO behavior
Learning-rate scaling by 1/sqrt(num_loops) for tied-weight recurrence
Async data prefetch and pinned-memory training pipeline optimizations
NCCL tuning for H100 NVLink
GPU-resident SWA accumulation
Cache cleanup after warmup to reduce fragmentation