PR #1726
openNon-record: Depth Recurrence Sweep — Systematic Layer Loop Ablation
by krishs0404View on GitHub
val_bpb
1.4689
Architecture
Transformer
Optimizer
—
Artifact Size
16,005,909 bytes
Training Techniques
Architecture
depth recurrence
Systematic ablation of looped layer ranges and loop activation timing in a depth recurrence training stack.
parameters: {"loop_start":3,"loop_end":5,"enable_looping_at":0.35,"num_loops":2}
U-Net skip connections
Looped encoder/decoder index construction uses U-Net style skip connections.
parameters: null
Quantization
GPTQ
bits: 6
scope: block weights
GPTQ
bits: 8
scope: embeddings
Evaluation
sliding window eval
parameters: null
Sequence Length
sequence_length
train_length: null
eval_length: null
Novel Contributions
- Systematic ablation of depth recurrence loop range across early, middle, and late layers
- Comparison of loop activation timing for depth recurrence
- Finding that the baseline middle-layer loop configuration (layers 3-5) is best among tested variants
- Observation that minimal reuse of layers 5-6 is nearly competitive with the baseline
- Demonstration that heavy reuse significantly hurts due to reduced throughput under a fixed wall-clock budget