val_bpb
1.1428
Architecture
Transformer
Optimizer
—
Artifact Size
~10MB
Training Techniques
Architecture
depth recurrence
Replaces 10 unique transformer layers with 6 unique layers (3 encoder + 3 decoder) looped twice to create 12 effective layers while sharing parameters.
parameters: {"unique_layers":6,"encoder_layers":3,"decoder_layers":3,"num_loops":2,"effective_layers":12}
weight tying
Tied encoder and decoder blocks are reused across loops with per-loop conditioning so repeated passes can behave differently.
parameters: {"num_loops":2}
Regularization
layerwise LN scale
parameters: {"per_loop_scale_bias":true}
LR Schedule
learning rate scaling
parameters: {"scale":"1/sqrt(num_loops)"}
Weight Averaging
SWA
parameters: {"gpu_resident":true,"dtype":"float32"}
Other
other
Async data prefetch using a background thread and separate CUDA stream to overlap data loading with compute.
parameters: null
other
Pinned memory for faster host-to-device transfers.
parameters: null
other
NCCL tuning for H100 NVLink topology.
parameters: {"NCCL_NVLS_ENABLE":1,"NCCL_NET_GDR_LEVEL":5}
other
GPU-resident SWA accumulation to avoid device-to-host synchronization per checkpoint.
parameters: null
other
Called torch.cuda.empty_cache() after warmup to reduce memory fragmentation.
parameters: null
Novel Contributions
- Depth recurrence with 6 unique layers looped twice to produce 12 effective layers
- Per-loop conditioning via learned scale and bias for tied blocks
- U-Net-style skip connections preserved across effective layers with LIFO behavior
- Learning-rate scaling by 1/sqrt(num_loops) for tied-weight recurrence
- Async data prefetch and pinned-memory training pipeline optimizations
- NCCL tuning for H100 NVLink
- GPU-resident SWA accumulation
- Cache cleanup after warmup to reduce fragmentation