Azure Data Factory

Using Azure Data Factory Data Flow to transform and migrate data from Cosmos DB Collections

Migration and Transformation of Azure Cosmos DB Collection

Reitter

--

Introduction

Azure Data Factory is a data integration service that allows you create data pipelines where you can collect and transform your data without writing a single line of code.

It has built-in more than 90 data sources such as SQL Server, Salesforce, Azure Cosmos DB, Azure Blob Storage etc, and is a fully managed serverless cloud service, meaning that you only pay for what you use.

Azure Cosmos DB is a NoSQL database provided by Microsoft that has as its core feature performance. It is a globally-distributed and a multi-model database suited for highly responsive applications.

In this post we will migrate and transforme all the data from an Azure Cosmos DB collection with Azure Data Factory.

Azure Data Factory Linked services

Before you create a dataset or a pipeline, you must create a linked service to link your data source to the data factory. The Microsoft documentation describes Linked services as:

Linked services are much like connection strings, which define the connection information needed for Data Factory to connect to external resources.

In order to create a Linked service, go to the “Manage” tab on the Azure Data Factory page and click “New” as shown below:

Creating a Linked Service

There you will select the Cosmos DB SQL API and select your Cosmos DB account name and database.

Create Datasets

Datasets are the source and the destination of our data within the linked service. In our case we will create a dataset for the source collection and another dataset for the destination collection.

In order to create a dataset, go to the “Author” tab, then Datasets and finally “New dataset”:

Creating a dataset

There you will select the Azure Cosmos DB SQL API and then you will select the Linked service that was created previously. You will do it twice, one dataset is for each collection.

Creating a Mapping Data Flow

Mapping Data Flow allows you to copy and transform data visually without having to write a single line of code. You can create a Data Flow where you get data from multiple sources, join the data, transform and finally store it in a new database.

In our scenario we will get the data from a collection, add a new property for each document and then save the data on another collection.

In order to create a Mapping Data Flow, go to “Data flows” then select “New mapping dataflow”.

Creating a Data flow

You will then select “Data flow”:

Creating a data flow

Once created, you will select “Add source” in order to add your previously created Dataset that will be the source of the data.

You will now transform each document before persisting them on the other collection. In order to do that, you click on the small plus sign on the bottom right corner of the data source and the select “Derived Column”:

Adding data transformation step to our data flow

Now you can create a new property for each document. In this example the property is called “New Property” as seen below:

Transforming the data

In order to add a value for this property, select the “Open Expression builder”. You can create complex logic for the value of the property. In our case, we will set it to be the same as other document property. To do so, simply type any of your document’s property name.

Now you can add the destination dataset for the data to be persisted. In order to do that, click on the small plus sign on the botom right corner of the Derived Column step and add a Sink:

Adding the Sink dataset

Data Flow debug mode

The debug mode allow us to see the result of each step of the Data Flow without having to run the pipeline. This is very useful when creating complex data pipelines.

Debug mode allows you to interactively see the results of each transformation step while you build and debug your data flows. The debug session can be used both in when building your data flow logic and running pipeline debug runs with data flow activities.

To enable the debug mode, simply select the option below:

Enabling Data flow debug

For the Integration Runtime leave it as AutoResolveIntegrationRuntime.

After doing that, select Publish all to persist all the changes made on the Azure Data Factory.

Creating a Pipeline

After creating your Data Flow, you must put it into a pipeline, just like anything in Data Factory. To create a pipeline, select “New Pipeline”:

Creating a pipeline

Now you select “Move and Transform” and then drag and drop the Data Flow icon:

Creating a pipeline

Now everything is ready to be run. In order to run the pipeline there are two options, you can debug the pipeline or you can execute the pipeline. The difference from the two are explained by a great article by Cathrine Wilhelmsen:

The difference between debugging and executing pipelines is that debugging does not log execution information, so you cannot see the results on the Monitor page. Instead, you can only see the results in the output pane in the pipeline.

As we want to see and have the history of the pipelines executed, we will be executing the pipeline. To do so, we select “Add trigger” then “Trigger now”

Executing the pipeline

The time it takes to run the transformation on a small dataset is not due to the size of the dataset, but rather due to the overhead of running the pipeline.

To see the status of the the pipeline, we select “Monitor” on the left of the page and then the running pipeline will be there with the information.

Conclusion

In this article we have create an Azure Data Factory pipeline in order to move and transform data from an Azure Cosmos DB collection. If you would like to read more about it, in the reference section below are some great articles about this subject.

--

--