PR #1449

open

Non-Record: Full-Model Depth Recurrence Ablation — 7 configs, torch.compile penalty = 0

by codeprakhar25View on GitHub
val_bpb
1.3680
Architecture
Transformer
Optimizer
Muon
Artifact Size
10.0 MB

Training Techniques

Architecture
depth recurrence
Cycles the full set of transformer blocks multiple times with effective depth N×R.
parameters: {"unique_blocks":7,"repeats":2}
U-Net skip connections
Skip connections are reindexed by effective layer position across recurrence boundaries.
parameters: null
BigramHash
Adds bigram hash embeddings alongside recurrent depth configuration.
parameters: null
ReLU²
Uses squared ReLU activation in the baseline architecture.
parameters: null
GQA
Uses grouped query attention in the baseline architecture.
parameters: null
RoPE
Uses rotary positional embeddings in the baseline architecture.
parameters: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: null
Compression
int8
level: null

Novel Contributions

  • Systematic ablation of full-model depth recurrence across seven configurations
  • Demonstration that torch.compile shows zero slowdown penalty under depth recurrence
  • Identification of 7×2 as the best size-performance tradeoff among tested configs
  • Observation that more repeats hurt while more unique blocks help
  • Finding that BigramHash does not complement recurrence at low step counts
  • Finding that naive width scaling fails without hyperparameter retuning