Understanding the data governance capability¶

Note

The Data Playbook defines a set of capabilities that represent conceptual building blocks that are used to build data-related solutions. See Defining data capabilities to see the full set of capabilities defined in the playbook.

Data governance includes the collection of processes, policies, roles, metrics, and standards that ensure an effective and efficient use of information. It also helps establish data management processes to keep data secure, private, accurate, and usable throughout the data lifecycle.

Data governance can be implemented in the following operating models:

Centralized.
Federated.
Decentralized.

Access to trusted enterprise data is essential for breaking down data silos, democratizing data, enabling intelligent experiences, and powering digital transformations. To achieve this, your organization needs a strong data governance strategy to drive business growth, handle sensitive information, make informed decisions, and succeed in a competitive market.

A few critical principles to keep in mind while designing a data governance strategy:

Cross-Organization data asset Ownership and accountability.
Standardized rules and regulations.
Dedicated Data Steward/Data Administrator.
High-quality, reliable data.
Transparency.

Upon successful implementation, data governance ensures that the data estate is cleanly audited, evaluated, documented, managed, protected and trustworthy. However, implementing unified data governance across any organization is challenging because of many factors such as:

Organization-wide acceptance and cross domain collaboration.
Existence of siloed data with multiple sources of truth.
Balancing governance standards and flexibility.
Stakeholder alignment.
Access control.

A few recommendations for a successful data governance implementation include:

Think big but start small.
Appoint an executive sponsor.
Aligning stakeholders & build the case for data governance.
Develop the right metrics.
Keep communicating with all stakeholders.

Understanding data governance characteristics¶

A robust framework for enterprise-scale data governance includes important characteristics and functions that are defined below.

Defining data catalog, data discovery, and main features¶

The process of data cataloging and discovery includes maintaining a physical record of organization's metadata and data assets, in a unified and scalable manner. This is then enables automatic identification, description, logical search, and discovery.

Metadata consists of both technical properties of the dataset and business metadata.

Technical properties include data schema and structure, physical location, type, format, approximate size, and lineage.
Business metadata includes related glossary terms, data owners, data stewards, data classifications, data sensitivity and more.

Data catalog is a central collection of enterprise metadata. It offers a unified view to manage the organization’s data, which is crucial for meeting regulations like GDPR. Furthermore, it provides other insights to understand how data is being created and used across the data estate.

At a minimum, a data catalog should enable the following features:

Onboard dataset metadata: Ideally, metadata updates happen regularly, automatically, and at scale—generally through data source scanning—which would capture all technical metadata associated with each dataset, including data lineage.
Curate and enrich: After the dataset metadata has been onboarded, the catalog should enable dataset owners and stewards to enrich the metadata with any business-specific information, including associated glossary terms. It should also ideally enable automatic classification of the dataset at scale for effective data governance.
Enable data discoverability and evaluation: Data catalog should enable users to efficiently search and browse for datasets across technical metadata dimensions and business semantics. It should also allow users to quickly and easily evaluate datasets to determine if they're fit for the intended purpose. This check can be done by allowing dataset previews, showing lineage information, data sensitivity, related glossary terms and more.

Data discovery: provides the data intelligence an organization needs to develop new products and services that provide scalable data governance, protect data from risk exposure, democratizing data use and uncover new insights and opportunities in business value creation.

At a minimum, data discovery should enable the following features:

Metadata synchronization: Large enterprises might have several divisions, subsidiaries or companies with different data catalogs that use one or several technologies. It's common to see a need for data discovery across the whole enterprise. A solution is to synchronize metadata (such as, data catalogs' data). The Azure Architecture center gives you guidance on how to consolidate the metadata in Purview by providing a collection structure that hosts the result of scanned data, and imported metadata. It also proposes an architecture to implement the synchronization from external data catalogs. Other data catalog vendors can also provide similar guidance or components (for example, Azure Purview to Collibra Integration - Collibra Marketplace).

Machine Learning (ML) based solutions: The data discovery process can be enhanced by using machine learning. By using machine learning techniques, data discovery becomes smart. ML can discover and infer relationships between data. It can also accelerate an organization’s understanding of their data.

SQL based data discovery: Azure SQL database services include Azure SQL database, Azure SQL Managed Instance, Azure Synapse Analytics and SQL Server. These services have built-in basic capabilities for discovering, classifying, labeling, and reporting sensitive data. The goal is to protect the data and not just the database. Currently it supports the following capabilities:

Discovery & recommendations: The classification engine scans your database and identifies columns containing potentially sensitive data. It then provides you with an easy way to review and apply the appropriate classification recommendations, and to manually classify columns.
- Labeling: Sensitivity classification labels can be persistently tagged on columns.
- *Visibility: The database classification state can be viewed in a detailed report that can be printed or exported to be used for compliance and auditing purposes.
This approach isn't aligned with the idea of having a centralized discovery and cataloging mechanism. Rather it's a specific and constrained data discovery option.

For more information on this topic, see the following documentation:
- Data discovery & classification
- SQL data discovery and classification

Using data classification¶

Tagging data assets with appropriate information, privacy, or other sensitivity classifications to secure onward use and protection.

Managing data ownership¶

Ensure data assets are owned for protection, description, access, and data quality by accountable and empowered agents within the organization.

Managing data security¶

Ensure data is encrypted, obfuscated, tokenized, or has other appropriate security measures applied in line with its classification. Includes capturing evidence of security application and management of data loss prevention.

Ensure data is being stored, accessed, and processed according to jurisdictional rules and prohibitions. This also requires establishing data sharing and collaboration policies and standards that reflect data quality, security, privacy, and compliance requirements.

Managing data quality¶

Ensure data is fit for its purpose according to the core measures of data quality—accuracy, completeness, consistency, validity, relevance, and timeliness.

Managing data lifecycle management¶

Ensure data is sourced, stored, processed, accessed, and disposed of in line with its legal, regulatory, and privacy lifecycle requirements, which are often defined in a retention schedule.

Managing data entitlements and access tracking¶

Ensures data is only accessible by authorized people and processes. Auditing this access is an important part of evidencing and ensuring control.

Data entitlement is a concept that enhances security, compliance, and data governance by providing fine-grained control over who accesses what data and under which conditions. This context might include attributes like time, location, device, and more.
Access control consists of two main components: authentication and authorization.
Access tracking consists of tracking who is accessing the systems.

Defining data lineage¶

Provides full visibility of data through its lifecycle, identifying its source, processing, movement, and usage. This ensures the use of trusted data sources and well-documented transformations, making it easier to verify and track data accurately. Data lineage is useful for troubleshooting, root cause analysis, data quality, compliance, and impact analysis.

Incorporating a framework for data privacy¶

Defines a framework for the protection of the privacy of data products that reflects the regulatory and privacy laws governing your organization. This also ensures processes and technologies are employed to ensure the privacy framework is actively applied.

Identifying and maintaining trusted sources and data contracts¶

Large organizations might have similar data originating from or processed through many sources. Identifying and managing trusted sources and defining consumption data contracts is important to ensure data is being sourced from an agreed source of truth and the overall data architecture is being managed effectively.

Guiding ethical use and purpose¶

This is increasingly being questioned beyond privacy laws and data subject rights. As the use of AI and machine learning increases, it's important to establish ethical principles based on transparency, accountability, and fairness, that guide the use of data and to ensure that data is being processed in a way that customers would expect according to the organization's code of ethics.

Defining and using master data management¶

Ensures there's a single consistent view of the organization's master data which is fundamental to accurate and reliable data usage. The master data is the most commonly used data within an organization and describes the core operational aspects of a company, for example, product, customer, employees, and company structure. Hence it's important to have a reliable single source of truth for this data.

Understanding data version control¶

Provides the ability to get a clear picture of the data at any given time by capturing and saving different versions of datasets along with the metadata and transformations that go with them. This ensures that the same dataset used at any given point in time is still available and can be used in the future, even if the dataset has been changed or updated.

See Data Versioning in ML for details.

Learn more about data governance in Microsoft Fabric¶

See Microsoft Fabric governance and compliance for a list of capabilities in Microsoft Fabric for data governance.

Examples¶

Microsoft Purview is Microsoft’s flagship data governance tool. It's recommended for use as the main data catalog, particularly for data estates deployed on Azure. A company should typically try to have only one instance of Purview in production, as described in Microsoft Purview accounts architectures and best practices.

See Microsoft Purview lineage user guide for more details on data lineage in Microsoft Purview.

The following samples focus on Microsoft Purview:

MDW Repo: Data Governance: This end-to-end sample showcases how to incorporate data governance in a modern data warehouse architecture. It uses Microsoft Purview and Presidio while showcasing associated DevOps processes to operationalize the solution.
Publish and Subscribe Purview events: Integration of Microsoft Purview and third-party services using a Kafka Topic. It includes sample scripts, which demonstrate available publish and subscribe operations.
The RDF Import Solution Accelerator: is an exemplary implementation to import an ontology and its corresponding individuals to Purview using the Purview REST API.

From a governance perspective, there's another application in Microsoft Purview called Data Estate Insights, which is purpose-built for governance stakeholders such as Chief Data Officer. This application provides actionable insights into the organization’s data estate, catalog usage, adoption, and processes. For more information, please refer to Microsoft Purview Data Estate Insights application.

For more information¶

Data governance in Azure
Microsoft Purview
Data: Data quality
Azure governance
Commonly used data governance tools/software