PR #1509

open

Non-record: DepthScale — Parameter-Shared Iterative Transformer (1.1962 BPB)

val_bpb

1.1962

Architecture

Transformer

Optimizer

—

Artifact Size

30MB

Training Techniques

Architecture

depth recurrence

Reuses the same 5 physical transformer layers across multiple iterations to create 10 effective layers of depth with shared weights.

parameters: {"layers":5,"iterations":2,"effective_depth":10}

RoPE

Iteration-aware RoPE shifts positional frequencies by an iteration-dependent offset so repeated passes can learn distinct attention patterns.

parameters: {"epsilon":0.1}

Quantization

int8

bits: 8

scope: all

STE QAT

bits: 4

scope: all

Compression

zlib

level: null

Sequence Length

sequence_length

train_length: null

eval_length: null

Parameter-shared iterative transformer that reuses 5 physical layers across multiple passes
Iteration-aware RoPE to distinguish different iterations of the shared-depth model
Demonstration of 10 effective layers of depth at constant parameter cost
3-seed reproducibility result of 1.1962 BPB on 8×H100 SXM
Quantization-aware training with 4-bit STE for robustness to extreme quantization