PR #1449

open

Non-Record: Full-Model Depth Recurrence Ablation — 7 configs, torch.compile penalty = 0

by codeprakhar25View on GitHub

val_bpb

1.3680

Architecture

Transformer

Optimizer

Muon

Artifact Size

10.0 MB

Training Techniques

Architecture

depth recurrence

Cycles the full set of transformer blocks multiple times with effective depth N×R.

parameters: {"unique_blocks":7,"repeats":2}

U-Net skip connections

Skip connections are reindexed by effective layer position across recurrence boundaries.

parameters: null

BigramHash

Adds bigram hash embeddings alongside recurrent depth configuration.

parameters: null

ReLU²

Uses squared ReLU activation in the baseline architecture.

parameters: null

GQA

Uses grouped query attention in the baseline architecture.

parameters: null

RoPE

Uses rotary positional embeddings in the baseline architecture.

parameters: null

Optimizer

Muon

weight_decay: null

momentum: null

other_params: null

Compression

int8

level: null

Systematic ablation of full-model depth recurrence across seven configurations
Demonstration that torch.compile shows zero slowdown penalty under depth recurrence
Identification of 7×2 as the best size-performance tradeoff among tested configs
Observation that more repeats hurt while more unique blocks help
Finding that BigramHash does not complement recurrence at low step counts
Finding that naive width scaling fails without hyperparameter retuning