Azure Data Factory 4 Everyone
“In the world of big data, raw, unorganized data is often stored in relational, non-relational, and other storage systems. However, on its own, raw data doesn’t have the proper context or meaning to provide meaningful insights to analysts, data scientists, or business decision-makers.
Big data requires service that can orchestrate and operationalize processes to refine these enormous stores of raw data into actionable business insights. Azure Data Factory is a managed cloud service that’s built for these complex hybrid extract-transform-load (ETL), extract-load-transform (ELT), and data integration projects.”
What is Azure Data Factory? from Microsoft DOCS
This blog-post is part of an Azure Data Factory series blog-posts – the first blog-post is an introduction and the main component describes.
Azure Data Factory – Concept & Components
Azure Data Factory is a managed data integration service that allows you to orchestrate and automate data movement and data transformation to and from Azure. Azure Data Factory works with many platforms and mixed environments, enabling data-driven workflow to integrate data sources (cloud and on-premises).
Azure Data Factory allows you to connect and integrate into many platforms, for example, SQL Server, Azure SQL Database, Oracle, MySQL, as well as file storage, Big Data systems like the local file system, blob storage, Azure Lake and many more. Moreover, Azure Data Factory can be used to initiate SQL Server Integration Services (SSIS) packages.
With SSIS it can be useful in situations where you require more sophisticated data movements and transformation tasks.
The main advantage of using Azure Data Factory is its capability to process and transform data by using computing services such as Azure Data Lake Analytics, Azure Machine Learning, and Azure HD Insights Hadoop. The output can then be published to data stores for BI tools to perform visualization or analytics.
Connect and collect
Enterprises have data of many types that are located in different sources, for example, on-premises, in the cloud, structured, unstructured, and Others. All arriving at different intervals and speeds.
The first step in developing an information production system is to connect to all the required sources of data and processing, such as SaaS, databases, file shares, and FTP web services. The next step is to move the data as needed to a centralized location for subsequent processing.
Without Data Factory, enterprises must build custom data movement components or write custom services to integrate these data sources and processing. It’s expensive and hard to integrate and maintain such systems. Besides, they often lack the enterprise-grade monitoring, alerting, and the controls that a fully managed service can offer.
Azure Data Factory allows you to Copy Activity in a data pipeline to move data from both on-premises and cloud source data stores to a centralization data store in the cloud for further analysis. For example, you can collect data in Azure Data Lake Storage and transform the data later by using an Azure Data Lake Analytics compute service. You can also collect data in Azure Blob storage and transform it later by using an Azure HDInsight Hadoop cluster.
After data is present in a centralized data store in the cloud, process, or transform the collected data by using Azure Data Factory mapping data flows. Data flows enable data engineers to build and maintain data transformation graphs that execute on Spark without needing to understand Spark clusters or Spark programming.
If you prefer to code transformations by hand, Azure Data Factory supports external activities for executing your transformations on compute services such as HDInsight Hadoop, Spark, Data Lake Analytics, and Machine Learning.
Data Factory offers full support for CI/CD of your data pipelines using Azure DevOps and GitHub. This allows you to incrementally develop and deliver your ETL processes before publishing the finished product. After the raw data has been refined into a business-ready consumable form, load the data into Azure Data Warehouse, Azure SQL Database, Azure CosmosDB, or whichever analytics engine your business users can point to from their business intelligence tools.
Azure Data Factory can be used as any traditional ETL tool, but the primary objective is to migrate any data to Azure Data Services for more processing or visualization. A cloud-based data integration service that allows organizations in operationalizing, building, debugging, deploying, and monitoring big data.
Azure Data Factory uses many platforms but also including many internal components in which it allows transform source data and make it consumable for the end-product:
- Linked Services
- Integration Runtimes
- and more
Let’s take a few of the main components inside Azure Data Factory
Pipelines are defined as a logical group of activities to perform a part of the work and can hold multiple pipelines with each pipeline having multiple activities. The activities inside the pipelines can be structured to run sequentially or in parallel, depending on the system requirements. The pipeline allows users to easily manage and schedule multiple activities together.
Dataflows are special types of activities and they allow data engineers to develop a data transformation logic visually without writing code. For example, this can be executed inside the Azure Data Factory pipeline on the Azure Databricks cluster for scaled out processing using Spark. The good here is the Azure Data Factory controls all the data flow execution and code translation.
Datasets provide the movement or transformation of data, and in these settings, you need to specify the data configuration settings. A dataset can contain a database table or file name, folder structure, etc. Moreover, every data set belongs to a linked service.
Linked Services is a pre-requisite step, and before connecting to a data source you must create a connection string. For example, the Linked Services perform a connection string in a SQL server. They contain information about data sources & services. The requesting identity connects to a data source using this linked service/connection string.
Integration Runtime provides a compute infrastructure on which activities can be executed, and it allows you to create them from three types:
- Azure IR service provides a fully managed, serverless compute in Azure and all movement and transformation of data are done in the cloud data stores
- Self-hosted IR service that manages activities between cloud data stores and a data store residing in a private network
- Azure-SSIS IR is primarily required to execute native SSIS packages
Azure Data Factory Scenario – Reduce overhead costs
During transferring SQL Server DB to the cloud, preserve your ETL processes, and reduce operational complexity with a fully managed experience in Azure Data Factory. Rehost on-premises SSIS packages in the cloud with minimal effort using Azure SSIS integration runtime. ETL in Azure Data Factory provides familiar SSIS tools.
Triggers represent the part of processing that defines when a pipeline execution needs to be kicked off. There are different types of triggers for different types of events.
The pipeline runs is an instance of the pipeline execution and typically instantiated by passing the arguments to the parameters that are defined in pipelines. The arguments can be passed manually or within the trigger definition.
Parameters are key-value pairs of read-only configuration. Parameters are defined in the pipeline. The arguments for the defined parameters are passed during execution from the run context that was created by a trigger or a pipeline that was executed manually. Activities within the pipeline consume the parameter values.
Control flow is an orchestration of pipeline activities that includes chaining activities in a sequence, branching, defining parameters at the pipeline level, and passing arguments while invoking the pipeline on-demand or from a trigger. It also includes custom-state passing and looping containers, that is, For-each iterators.
Variables can be used inside of pipelines to store temporary values and can also be used in conjunction with parameters to enable passing values between pipelines, data flows, and other activities.