Understanding the data processing capability¶

Note

The Data Playbook defines a set of capabilities that represent conceptual building blocks that are used to build data-related solutions. See Defining data capabilities to see the full set of capabilities defined in the playbook.

The goal of data processing is to process or transform the (raw) data ingested by the data ingestion to support business operations. Other terms commonly used for data processing are data transformation, data wrangling, and data munging.

Generally, data processing systems optimize either for higher overall throughput or lower end-to-end latency. This difference is generally what distinguishes batch processing from stream/real-time processing frameworks. It is important to note that this distinction is not black and white, but rather a range with various data processing frameworks sitting somewhere in this spectrum. The target of the data processing is almost always the data repositories built per the data architecture in place(for example, DWH, data lake or lakehouse or data mesh or a combination of these).

Refer to ETL/ELT to learn more about the foundational concepts of any data processing implementation.

Refer to the Understanding data processing characteristics section to note the common steps associated with data processing.

Common data processing types/architectures are:

Relational/Transactional Systems
- OLTP
- OLAP
Big Data Systems
- Lambda
- Kappa
- IOT
Event Driven Systems

Understanding data processing characteristics¶

Common steps in this stage include:

Converting data into required/standard formats.
Identifying invalid records (record level validation).
Identifying and optional removal of duplicate records.
Preparing to store in target architecture (type of storage (db or lake), levels of normalization for db, etc.).
Column/field level data validations.
Gathering of operational data.
As per (business) requirements, applying data transformations to derive new set of data elements and/or transform existing fields.
Applying data quality checks.
Performing data cleansing.
Populating Raw/Bronze and Silver/Transformed storages that are maintained with more granularity and Gold/Curated layers with aggregated information.
Ensuring data availability to other data processes/systems as per agreements (API requests, data formats, SLAs).
Monitoring and logging at each step.

Note that the list is not exhaustive and not all systems include all these steps.

Learn more about data processing in Microsoft Fabric¶

Microsoft Fabric supports both batch and stream processing. Data processing on Fabric also supports both No-Code/Low-Code solutions and data engineering oriented solutions. For batch processing (cold path), Microsoft Fabric supports this through Data pipelines, Dataflows, Notebooks executed by Spark Jobs, and SQL Queries through the Data Warehouse feature. Real-Time Analytics makes stream processing (hot path) possible in Microsoft Fabric.

Refer to decision guide for a comparison of these options.

Examples¶

Data Processing: Preprocessing ROS files: Example batch process to extract large bag files (ROS) into raw layer.
Event driven processing example: Temperature events using serverless compute.

Learn how to use batch processing¶

File processing and validation using logic apps and durable functions

Learn how to use stream processing¶

Project StreamShark: StreamShark (SS) is a console application that connects and retrieves all messages from an event hub. The tool provides real-time observation of event flow patterns with an intent to shorten the application development feedback loop. When SS is executed, the sequence is: The application connects to Azure Event Hubs, retrieves data from the last checkpoint, aggregates metrics about the received data in real time, displays the metrics in a terminal window.
Azure Kusto Labs: This repository features self-contained, hands-on-labs with detailed and step-by-step instructions. The artifacts (data, code, etc.) to try out various features and integration points of Azure Data Explorer (Kusto) are also included.
Streaming at Scale: The samples show how to set up an end-to-end solution to implement a streaming at scale scenario using a choice of different Azure technologies. A few possible ways to implement streaming in Azure are Kappa or Lambda architectures, a variation of them, or custom ones. Each architectural solution can also be implemented with different technologies, with its own pros and cons.
Build a delta lake to support ad hoc queries in online leisure and travel booking: This architecture provides an example delta lake for travel booking. Here, large amounts of raw documents are generated at a high frequency.
Serverless event processing: Reference architecture showing a serverless event-driven architecture. The sample ingests a stream of data, processes the data, and writes the results to a back-end database.
De-batch and filter serverless event processing with Event Hubs: A solution idea showing a variation of a serverless event-driven architecture using Azure Event Hubs and Azure Functions to ingest and process a stream of data. The results are written to a database for storage and future review once de-batched and filtered.
Azure Kubernetes in event stream processing: Describes a variation of a serverless event-driven architecture, hosted on Azure Kubernetes with KEDA scaler. The sample ingests a stream of data, processes the data, and then writes the results to a back-end database.