Introduction
Azure Data Factory is a data integration service that allows you create data pipelines where you can collect and transform your data without writing a single line of code.
It has built-in more than 90 data sources such as SQL Server, Salesforce, Azure Cosmos DB, Azure Blob Storage etc, and is a fully managed serverless cloud service, meaning that you only pay for what you use.
Azure Cosmos DB is a NoSQL database provided by Microsoft that has as its core feature performance. It is a globally-distributed and a multi-model database suited for highly responsive applications.
In this post we will migrate and transforme all the data from an Azure Cosmos DB collection with Azure Data Factory.
Azure Data Factory Linked services
Before you create a dataset or a pipeline, you must create a linked service to link your data source to the data factory. The Microsoft documentation describes Linked services as:
In order to create a Linked service, go to the “Manage” tab on the Azure Data Factory page and click “New” as shown below:
There you will select the Cosmos DB SQL API and select your Cosmos DB account name and database.
Create Datasets
Datasets are the source and the destination of our data within the linked service. In our case we will create a dataset for the source collection and another dataset for the destination collection.
In order to create a dataset, go to the “Author” tab, then Datasets and finally “New dataset”:
There you will select the Azure Cosmos DB SQL API and then you will select the Linked service that was created previously. You will do it twice, one dataset is for each collection.
Creating a Mapping Data Flow
Mapping Data Flow allows you to copy and transform data visually without having to write a single line of code. You can create a Data Flow where you get data from multiple sources, join the data, transform and finally store it in a new database.
In our scenario we will get the data from a collection, add a new property for each document and then save the data on another collection.
In order to create a Mapping Data Flow, go to “Data flows” then select “New mapping dataflow”.
You will then select “Data flow”:
Once created, you will select “Add source” in order to add your previously created Dataset that will be the source of the data.
You will now transform each document before persisting them on the other collection. In order to do that, you click on the small plus sign on the bottom right corner of the data source and the select “Derived Column”:
Now you can create a new property for each document. In this example the property is called “New Property” as seen below:
In order to add a value for this property, select the “Open Expression builder”. You can create complex logic for the value of the property. In our case, we will set it to be the same as other document property. To do so, simply type any of your document’s property name.
Now you can add the destination dataset for the data to be persisted. In order to do that, click on the small plus sign on the botom right corner of the Derived Column step and add a Sink:
Data Flow debug mode
The debug mode allow us to see the result of each step of the Data Flow without having to run the pipeline. This is very useful when creating complex data pipelines.
To enable the debug mode, simply select the option below:
For the Integration Runtime
leave it as AutoResolveIntegrationRuntime
.
After doing that, select Publish all to persist all the changes made on the Azure Data Factory.
Creating a Pipeline
After creating your Data Flow, you must put it into a pipeline, just like anything in Data Factory. To create a pipeline, select “New Pipeline”:
Now you select “Move and Transform” and then drag and drop the Data Flow icon:
Now everything is ready to be run. In order to run the pipeline there are two options, you can debug the pipeline or you can execute the pipeline. The difference from the two are explained by a great article by Cathrine Wilhelmsen:
As we want to see and have the history of the pipelines executed, we will be executing the pipeline. To do so, we select “Add trigger” then “Trigger now”
The time it takes to run the transformation on a small dataset is not due to the size of the dataset, but rather due to the overhead of running the pipeline.
To see the status of the the pipeline, we select “Monitor” on the left of the page and then the running pipeline will be there with the information.
Conclusion
In this article we have create an Azure Data Factory pipeline in order to move and transform data from an Azure Cosmos DB collection. If you would like to read more about it, in the reference section below are some great articles about this subject.