Skip to content

Exploring the Modern Data Warehouse

The Modern Data Warehouse (MDW) is a common architectural pattern to build analytical data pipelines in a cloud-first environment. The MDW pattern is foundational to enable advanced analytical workloads such as machine learning (ML) alongside traditional ones such as business intelligence (BI).

Learn about the traditional data warehouse vs the Modern Data Warehouse

The modern data warehouse unlocks advanced capabilities related to analytics that would otherwise be difficult to achieve with traditional data warehousing architectures. In a traditional data warehouse, data pipelines, and the relevant dimensional model (star schema) are built based on known reporting requirements. So, analytics requirements on a traditional data warehouse can be achieved using a top-down (deductive) approach. For advanced analytical requirements in machine learning use cases, reporting outputs are unknown at the start. This requires an iterative exploratory analysis phase by data scientists. This phase helps uncover insights in raw datasets and their relevance specific to the outcome. Hence, implementation of advanced analytics can be described as a bottom-up approach (inductive).

The following diagram shows the different types of analytics that can be done using both traditional and modern data warehouses:

Descriptive Analytics
Descriptive Anal...
Diagnostic Analytics
Diagnostic Analy...
Predictive Analytics
Predictive Analy...
Prescriptive Analytics
Prescriptive Ana...
Information
Information
Optimization
Optimizati...
What happened?
What happe...
Why did it happen?
Why did it...
What will happen?
What will...
How can we make it happen?
How can we make...
Top-down
(Deductive)
Top-down...
Bottom-up
(Inductive)
Bottom-up...
Difficulty
Difficu...
Value
Value
Traditional Data Warehouse
Traditi...
Modern Data Warehouse
Modern...
Viewer does not support full SVG 1.1

Compared to a traditional RDBMS data warehouse, a data lake is the primary means of data storage in an MDW. Data lakes support storage of both structured and unstructured datasets, which is required for advanced analytics use cases. Data lake also enables schema-on-read access, which is crucial for exploratory analysis. The RDBMS data warehouse is still an important component of the MDW architecture but is now used as a serving layer to enable traditional business intelligence reporting.

The mechanism of loading data is also different. While extract-transform-load (ETL) (SSIS is an example) is preferred in traditional data warehousing, extract-load-transform (ELT) is preferred in MDW. In MDW with ELT, data is first ingested into the data lake as-is and then transformed.

Learn about the Modern Data Warehouse architecture

The following are the four stages in an MDW architecture:

  1. Ingest: Different data sources are ingested and persisted into the data lake.
  2. Transform: Data is validated and transformed into a pre-determined schema.
  3. Model: Data is then modeled into a form optimized for consumption (for example, Star schema).
  4. Serve: Data is exposed for consumption. It includes enabling visualization and analysis by end users.

The following functional components of the architecture enables these four stages:

  • Storage: The main concern for this component is to serve as the primary storage for the data lake and for the relevant serving layers. For details, see Understanding data lake section.
  • Compute: Includes compute for the ingest, transform, and serve stages. Common data computing frameworks include: Apache Spark (available through Azure Synapse Spark pools or Azure Databricks), ADF data flows and Azure Synapse SQL dedicated pools (particularly for the serving layer).
  • Orchestrator: Responsible for end-to-end orchestration of the data pipeline. Azure Data Factory and Azure Synapse data pipelines are common data orchestrators.

The non-functional components that need to be considered are:

  • Security: Includes platform, application, and data security. For more information, see Data: Security.

Learn about the MDW logical architecture

The below diagram shows the logical MDW architecture along with its functional components:

Storage
Storage
Ingest
Ingest
Transform
Transform
Model
Model
Serve
Serve
Data Sources
Data Sources
BI + Reporting
Advanced Analytics
Real Time Analytics
BI + Reporting...
Orchestrate
Orchestrate
Viewer does not support full SVG 1.1

The below diagram shows the logical MDW architecture with corresponding Azure services. The list of Azure services is non-exhaustive.

Storage
Storage
Ingest
Ingest
Transform
Transform
Model
Model
Serve
Serve
Data Sources
Data Sources
BI + Reporting
Advanced Analytics
Real Time Analytics
BI + Reporting...
Orchestrate
Orchestrate
Synapse
Synapse
Databricks
Databric...
Data Factory
Data...
PowerBi
HDInsight
HDInsight
Synapse
Synap...
Eventhubs
Event...
ADLS Gen2
ADLS Ge...
Data Factory
Data...
Synapse
Synap...
Viewer does not support full SVG 1.1

Here are a few samples of implementing a MDW architecture:

Monitoring
Monitoring
Parking
Web Service
Parkin...
LANDING
LANDING
MALFORMED
MALFORMED
INTERIM
INTERIM
Polybase
Polybase
DW
DW
Cleanse &
Standardize
Cleanse...
Transform
Transform
Explore
Explore
Synapse
(SQL Dedicated)
Synapse...
PowerBI
Business User
Busines...
Data Scientist / 
Data Engineer
Data...
AppInsights
AppIns...
Log Analytics
Log Analy...
Secrets
Secrets
Key Vault
Key Vault
AZURE STORAGE (DATALAE GEN2)
AZURE STORAGE (DATALAE GEN2)
Copy
Copy
DATA PIPELINE
DATA PIPELINE
DATA SOURCE
DATA SOURCE
SERVING
SERVING
Orchestrate
Orchestr...
Viewer does not support full SVG 1.1

Monitoring
Monitoring
Parking
Web Service
Parkin...
LANDING
LANDING
MALFORMED
MALFORMED
INTERIM
INTERIM
Polybase
Polybase
DW
DW
Cleanse &
Standardize
(Spark Pool)
Cleanse...
Transform
(Spark Pool)
Transfor...
Explore
(SQL Serverless)
Explore...
Synapse
(SQL Dedicated)
Synapse...
PowerBI
Business User
Busines...
Data Scientist / 
Data Engineer
Data...
AppInsights
AppIns...
Log Analytics
Log Analy...
Secrets
Secrets
Key Vault
Key Vault
AZURE STORAGE (DATALAE GEN2)
AZURE STORAGE (DATALAE GEN2)
Copy
Copy
DATA PIPELINE
DATA PIPELINE
DATA SOURCE
DATA SOURCE
SERVING
SERVING
Orchestrate
(Data Pipeline)
Orchestr...
Viewer does not support full SVG 1.1

Understand the limitations

The MDW architectural principles are versatile in building analytical data pipelines in a cloud-first environment. However, they don't offer comprehensive guidance in the following areas:

  • Enterprise-wide data platform architecture.
  • Enterprise data governance.
  • Federated data architectures (Example: Data Mesh).
  • Data sharing.
  • On-premises data workloads.
  • Transactional data workloads.
  • Detailed guidance in pure streaming and event-driven data pipeline architectures.

Learn Modern Data Warehouse best practices

The following section elaborates more on the specific stages and components within the MDW architecture along with key considerations and best practices.

Understanding data lake

A primary component of MDW architecture is a data lake storage that acts as the source of truth for different datasets. It's recommended to use ADLS Gen2 for the data lake storage.

It's a best practice to logically divide the data lake into multiple zones corresponding to increasing levels of data quality. Data lake zones typically map directly to different data ingestion, transformation and serving outputs of your data pipeline activities. At least three zones (Bronze/Silver/Gold or Raw/Enriched/Curated) are recommended:

Bronze
Bronze
Silver
Silver
Gold
Gold
Raw, unprocessed
Raw, unprocessed
Cleansed, augmented
Cleansed, augmented
Optimized for consumption
Optimized for consumption
Viewer does not support full SVG 1.1

Raw layer - Datasets are kept as similar to the source dataset as possible with little to no transformations applied. Raw datasets give the ability to replay data pipelines if there are production issues. It also means that data pipelines should be designed to be replayable and idempotent.

Enriched - Datasets have data validation applied and standardized to a common type, format and schema. It's common to use parquet or delta as the storage format for this layer.

Curated - Datasets have been optimized for consumption in the form of Data Product. It's common to have these datasets with dimensional modeling applied and eventually loaded into a data warehouse. This data lake zone forms part of the Serving layer in the MDW architecture. The common storage format for this layer are parquet and delta.

For detailed information, see CAF: Data Lake Zones and Containers.

Implementing data ingestion

First, identify which data sources need to be ingested. For each of the identified data sources, determine:

  • Location.
  • Source system (FTP, storage account, SAP, SQL Server, etc.).
  • Data format, if applicable (CSV, Parquet, Avro) and corresponding closest matching destination format.
  • Expected volume and velocity (frequency, interval).

Second, identify the ingestion mechanism, for example, batch or streaming. The ingestion mechanism would determine appropriate technologies to be used. Azure Data Factory or Azure Synapse data pipelines are common services for ingesting batch datasets. Both provide an integration runtime (IR) for ingesting datasets on different networking environments (on-premises to cloud). While Azure Event Hubs and Azure IoT Hub are common ingestion points for streaming when paired with a streaming service such as Azure Steam Analytics. For more information, see: Data: Data Ingestion.

Generally, the ingestion pipeline will require a one-time historical load and a recurring delta load. For the latter, determine how to identify data deltas for each run. Two commonly used methods include change data capture and utilizing a dataset attribute to identify modified records. An external state store may be needed to keep track of the last loaded dataset in the form of a watermark table.

Consider using a metadata-driven approach for large-scale ingestion use cases such as loading multiple datasets. For more information, see Data: Config-driven Data Pipelines and ADF Metadata-driven copy jobs.

At times, data pre-processing is required after data ingestion and before data transformation. In certain cases where a large number of huge files (in gigabytes or more) are being ingested, pre-processing steps can get complicated. An industry-specific example is the processing of data in ROS format. ROS format requires the extraction of bagged files and metadata generation for each bag file to make it available for further processing.

For handling such scenarios, Azure Batch can be a great compute option. See Data: Pre-processing with Azure Batch for more details.

Performing data transformation

Following data ingestion into the Raw zone of a data lake, the data needs to be validated and then transformed into a standard format and schema in the Enriched zone.Data validations should be performed at this point. It's best practice to maintain a malformed record store to track records that failed validations to help with debugging issues.

Common services used at this stage include Azure Synapse Spark pools, Azure Databricks and data flows (ADF/Synapse).

See Data: Data processing and Data: CI/CD for data for more information.

Learn about data modeling

Data modeling goes hand-in-hand with data transformation. Here, data modeling refers to both the standardized data model in the enriched zone and the consumer-optimized data model in the curated zone. While related, both data models serve fundamentally different purposes. The goal of the enriched zone data model is to provide a common data model without a specific business use case in mind. The priority is completeness, data standardization, validity and uniformity across disparate sources from the raw zone. On the other hand, datasets in the curated zone are designed to be easily consumable. The steps may include data filtering, aggregations, and use case-specific transformations depending on the specific consumer of the dataset. The curated zone therefore may contain many derivative data products produced from datasets in the enriched zone.

See Data: Data modeling for more information.

How to serve data

The serving layer of the architecture functions primarily to serve the data to downstream consumers. The curated zone in the data lake forms part of the serving layer. Other components such as a data warehouse, an API, or dashboard are also commonly used. Depending on the number of consumers and their requirements, the same datasets may use multiple serving mechanisms.

Common services used at serve stage include Azure Synapse dedicated SQL pool, Azure SQL, and Microsoft Power BI.

Explore MDW technical samples

The following are technical samples showcasing concrete implementations of the Modern Data Warehouse pattern:

For more information