Understanding the DevOps for data capability¶
Note
The Data Playbook defines a set of capabilities that represent conceptual building blocks that are used to build data-related solutions. See Defining data capabilities to see the full set of capabilities defined in the playbook.
DevOps can be defined as the union of people, process, and products to enable continuous delivery of value to the business. It's an iterative process of "Developing", "Building & Testing", "Deploying", "Operating", "Monitoring and Learning" and "Planning and Tracking".
The application of DevOps principles to Data can be understood through the concepts of Data
and CI/CD
pipelines:
Understanding data pipelines vs CI/CD pipelines¶
- Data pipeline: Also termed as "value" pipelines, the data pipelines convert the raw data into meaningful information, thus delivering value to the business. Data engineers generally own the data ingestion, transformation, and sharing processes that are part of data pipelines. Data engineers are responsible for coding the business requirements into data pipelines.
- CI/CD pipeline: CI/CD pipelines are one of the most critical parts of DevOps in general. In the context of data systems, CI/CD pipelines continuously update data pipelines in different environments as new ideas are developed, tested, and deployed to Production. The CI/CD pipeline is often termed as an
innovation pipeline
as it enables the change process. The Platform automation and operations team typically own the maintenance of CI/CD pipelines.
Here we have two different orchestrators to maintain: One for the data pipeline and the other one for the CI/CD pipeline. The testing strategy for these two pipelines is going to differ as well. For the data pipeline, the data is changing while the CI/CD pipeline is fixed. Conversely for the CI/CD pipeline, the data is fixed while the CI/CD pipeline itself is changing.
How to use Infrastructure as Code (IaC) for data services¶
Automated provisioning:
- Automate the provisioning of data infrastructure and platforms using IaC.
- Enable the dynamic scaling of resources based on data processing demands.
Environment consistency:
- Ensure consistency across development, testing, and production environments using IaC.
- Automate the creation and configuration of databases, data warehouses, and analytics platforms.
How to use CI/CD for data¶
Continuous Integration (CI) is a DevOps practice of merging the code changes from different contributors to a centralized repository, where the automated builds and tests are then run. CI helps in identifying bugs or issues early in the development lifecycle when they're easier and faster to fix.
Continuous delivery (CD) builds upon continuous integration (CI). CD is the process of taking the build artifacts and deploying them to different environments, such as QA and Staging. CD helps in testing the new changes for stability, performance, and security.
Automated data pipelines:
- Implement automated data pipelines for seamless integration, transformation, and loading (ETL) processes.
- Use CI/CD practices to version-control and deploy changes to data pipelines.
Version control for data artifacts:
- Apply version control to build data-related artifacts, including models, transformations, and analytics code.
- Ensure traceability and repeatability of changes to data artifacts.
Automated testing:
- Implement automated testing for data processes to ensure data quality and reliability.
- Incorporate testing into the CI/CD pipeline for efficient and reliable releases.
- Include testing for both in the value pipelines (test for data) and innovation/delivery pipelines (test for code).
Implementing workflow orchestration¶
Workflow automation:
- Orchestrate end-to-end data workflows, integrating various tools and processes seamlessly. Check Data: Data Orchestration for details.
- Implement workflow automation for scheduling and coordinating data processing tasks.
Using monitoring and feedback¶
Real-time monitoring:
- Implement tools for continuous monitoring of data pipelines, databases, and analytics platforms.
- Monitor data quality, data integrity, and data security in real-time.
Alerting and feedback mechanisms:
- Set up automated alerts to notify teams of anomalies, errors, or performance degradation.
- Create feedback loops to aid communication between development, operations, and other relevant teams.
How to use configuration management¶
Parameterization:
- Use parameterized configurations to make IaC adaptable to different scenarios, environments, and data service requirements.
- For data pipelines, consider metadata-driven approaches to make the pipelines more flexible and adaptable.
Handing sensitive data:
- Keep secrets in separate configuration files that aren't checked in to the repo.
- Add such files to
.gitignore
to prevent them from being checked in. - Where possible, use Azure Key Vault to store and manage secrets.
Implementing security and compliance¶
Automated security measures:
- Implement automated security measures, including access controls, encryption, and identity management.
- Integrate security checks into the CI/CD pipeline for proactive security measures.
Compliance checks:
- Automate compliance checks to ensure that data processing adheres to regulatory requirements.
- Implement automated audits and reporting for compliance purposes.
Learn more about DevOps for data in Microsoft Fabric¶
Implementations¶
Parking sensors sample¶
The MDW Repo: Parking Sensors sample covers the end-to-end implementation of various characteristics of 'DevOps for Data'. It demonstrates how DevOps principles can be applied to end-to-end data pipeline solutions built according to the Modern Data Warehouse pattern. The sample covers the following articles:
- Bicep-based IaC deployment of Azure data services - Link.
- GitHub integration with Azure Data Factory (ADF) - Link.
- Build and Release (CI/CD) pipelines - Link.
- Environment variables and parameterization using Azure DevOps variable groups - Link.
- Automated testing of data pipelines including unit and integration tests - Link.
- Manual approval gates for release pipelines - Link.
ADF CI/CD auto publish¶
The MDW Repo: ADF CI/CD auto-publish sample demonstrates the deployment of Azure Data Factory (ADF) using Automated Publish
method. Usually, ADF deployment requires Manual Publish
setup where the developer publishes the ADF changes manually from the portal. This step generates ARM Templates that are used in deployment steps. The sample eliminates the manual publish by using a publicly available npm package @microsoft/azure-data-factory-utilities
for automated publishing. Also, check the official documentation on CI/CD in ADF and Automated publishing for CI/CD.
IaC deployment samples for secured networks¶
The following table contains IaC samples for deploying various Azure data services within a secure network configuration. These code samples are authored in Bicep
. But they can be easily customized for use with Terraform
.
Azure Data Service | IaC Code Sample |
---|---|
Azure Synapse Analytics | Azure Synapse VNet recipe |
Azure Databricks | Azure Databricks VNet recipe |
Microsoft Purview | Microsoft Purview VNet recipe |
Azure Data Factory | Azure Data Factory VNet recipe |
Examples¶
Review the following coverage of options for how to prep your sandbox environments.
Using sandbox environment options¶
For data solutions, sandbox environments generally need extra preparation steps depending on the Azure Services.
Data Service | Sandbox Environment Options |
---|---|
Data Lake Gen2 Storage | A common sandbox file system can be created, and each developer can then create their own folder within this filesystem. |
Azure SQL or SQL Data Warehouse | A transient database (restored from DEV) can be spun up per developer on demand. |
Azure Synapse Analytics | Git integration allows developers to make changes to their own branches and debug runs independently. |
How to use version control¶
Azure data services vary in their approaches to version control. The following table outlines available options for several commonly used Azure data services.
Azure Data Service | Documentation Link |
---|---|
Azure Data Factory | DevOps in ADF |
Azure Synapse Analytics | DevOps in Synapse Analytics |
Azure Databricks | DevOps in Azure Databricks |
Azure SQL or SQL Data Warehouse | DevOps in Azure SQL |
Learn how to publish data artifacts¶
Azure Artifacts can be used alongside Azure Pipelines for deploying packages, publishing build artifacts, or integrating files across pipeline stages. The following table contains links to various options for publishing data artifacts.
Artifact name | Applicable To | Example |
---|---|---|
SQL DACPAC | Azure SQL Database | Publishing SQL DACPAC |
Python Wheel | Python | Creating and publishing wheel distribution package |
Databricks Notebook | Azure Databricks | Publishing Databricks notebook |
Perform unit testing¶
- Apache Spark: The MDW Repo: Azure Databricks and MDW Repo: Azure Synapse samples showcase how to execute unit tests for data transformation code written in Apache Spark. It encapsulates the business logic into a python wheel package, keeping data access code in the notebook that loads the package.
- Azure Stream Analytics (ASA): The MDW Repo: Azure Stream Analytics sample showcases how to execute unit tests for ASA.
- Data Factory testing Framework: This stand-alone Data Factory - testing framework allows writing unit tests for Data Factory pipelines on Microsoft Fabric and Azure Data Factory.
Perform integration testing¶
- Azure Data Factory (ADF)/Azure Synapse Data Pipelines: The MDW Repo: ADF (Single) and MDW Repo: ADF (E2E Sample) | MDW Repo: Azure Synapse samples showcase using "pytest" framework to trigger a set of integration tests for ADF as part of a CD pipeline.
- Azure Stream Analytics: The MDW Repo: Azure Stream Analytics sample showcases how to do integration tests with ASA using Node.js (TypeScript).
For more information¶
- DataKitchen: DataOps Cookbook
- Microsoft: What is DevOps?
- Microsoft: DevOps Checklist
- Engineering Fundamentals: Continuous Integration
- Engineering Fundamentals: Continuous Delivery
- Engineering Fundamentals: Unit Testing
- Engineering Fundamentals: Integration Testing
- DevSecOps: Overview
- Engineering Fundamentals: Penetration Testing
- Engineering Fundamentals: Credential Scanning
- Engineering Fundamentals: Secrets Management