PR #1726

open

Non-record: Depth Recurrence Sweep — Systematic Layer Loop Ablation

val_bpb

1.4689

Architecture

Transformer

Optimizer

—

Artifact Size

16,005,909 bytes

Training Techniques

Architecture

depth recurrence

Systematic ablation of looped layer ranges and loop activation timing in a depth recurrence training stack.

parameters: {"loop_start":3,"loop_end":5,"enable_looping_at":0.35,"num_loops":2}

U-Net skip connections

Looped encoder/decoder index construction uses U-Net style skip connections.

parameters: null

Quantization

GPTQ

bits: 6

scope: block weights

GPTQ

bits: 8

scope: embeddings

Evaluation

sliding window eval

parameters: null

Sequence Length

sequence_length

train_length: null

eval_length: null

Systematic ablation of depth recurrence loop range across early, middle, and late layers
Comparison of loop activation timing for depth recurrence
Finding that the baseline middle-layer loop configuration (layers 3-5) is best among tested variants
Observation that minimal reuse of layers 5-6 is nearly competitive with the baseline
Demonstration that heavy reuse significantly hurts due to reduced throughput under a fixed wall-clock budget