Skip to content

Understanding the data ingestion capability

Note

The Data Playbook defines a set of capabilities that represent conceptual building blocks that are used to build data-related solutions. See Defining data capabilities to see the full set of capabilities defined in the playbook.

The first stage of the data engineering lifecycle is data ingestion from source systems where data is generated. Data is imported from one or more sources into a known storage medium where it can be further accessed by downstream processes.

Understanding data ingestion characteristics

When ingesting data, it can be in various formats. This data can be from streams coming from devices or files that have been extracted from another data source. The data is then either sent to its destination or retrieved from the source, depending on which system is managing the process.

The two major ingestion patterns are storage for batch processing and event streaming for real-time data ingestion.

When files need to be collected from multiple locations or from long running processes, a common approach is batch processing. Data is created and stored in a well-known location, then a trigger such as a set time, storage action, or size limit—starts the processing.

An example of this batch approach might be where a supplier posts a CSV catalog of available parts to an Azure Storage location. This file is then loaded into a system of record for processing purchasing orders. Another example might be a cron job that does a database extraction at midnight every night.

For real-time data, stream ingestion often manages continuous and unbounded datasets. Many of the current stream ingestion approaches can also make use of mini-batches to ingest data as it can potentially reduce the number of I/O operations. As each message (or event) arrives at the broker or other event router, a record is created. This event is then pushed to an event subscriber or pulled from an event consumer to process further.

An example for real-time data would be a refrigeration unit sending real-time temperature messages to Azure Event Hubs. During processing, if temperatures were outside of a specific range, an alert would be sent to an Operations Manager to ensure the product stored in the unit did't spoil.

Sometimes on-premises data will be secured behind a firewall with virtual networks or blocked by IP Restrictions. Additionally, some Microsoft services take advantage of an integration runtime for data integration. The integration runtime (IR) is the compute infrastructure that the pipelines use to provide these capabilities. For details about IR, see Integration runtime.

Learn more about data ingestion in Microsoft Fabric

Microsoft Fabric offers both visual and code based solutions for data ingestion. Some of them are listed here:

The capabilities of these extend from simple data processing while ingesting data to complex data processing and orchestration.

Examples

For more information