Understanding the data protection and security capability¶

Note

The Data Playbook defines a set of capabilities that represent conceptual building blocks that are used to build data-related solutions. See Defining data capabilities to see the full set of capabilities defined in the playbook.

Data democratization across the enterprises along with the creation of new laws and regulations for data protection is resulting in an unprecedented focus on data privacy requirements. Having a common understanding of data privacy is important. Wikipedia describes data privacy as "the relationship between the collection and dissemination of data, technology, the public expectation of privacy, and the legal and political issues surrounding them."

Data protection and Security is the capability that allows protecting and securing data according to its data privacy classification.

Understanding data protection and security characteristics¶

To protect and secure data, the data needs to be classified by the following characteristics:

Sensitivity: public, internal, confidential, restricted.
Usage: internal, external, within region, cross-regions.
Movement pattern: protect at rest, protect in transit.
Reliability: data is recoverable upon failures.

For highly sensitive data or personal identifiable data, obfuscation or anonymization techniques can be applied. There are many processes that can be implemented to ensure data privacy is accomplished and these processes fall mainly into two categories, namely Obfuscation or Pseudonymization and Anonymization.

How to use obfuscation or pseudonymization techniques to protect data¶

Data tokenization substitutes sensitive data with a value (token) that is meaningless, and the process can't be reversed. However, the token can be mapped back to the original data. The token has no meaning outside the system that creates it and links it to other data. This technique is used to protect sensitive data and is used to protect data at rest.
Data Encryption translates the personal data into another form or code so that the data that's categorized as sensitive is replaced with data in an unreadable format. Authorized users have access to a secret key that allows them to retrieve the original data.

Use anonymization techniques available¶

Data masking is a technique that can be seen as a permanent Tokenization and is used to protect data in use (not at rest). Once the data is randomized using a masking process, it cannot be reversed back to its original state. Masking techniques include:
- Data scrambling is an anonymization technique, which involves a mixing or obfuscation of characters. IMPORTANT NOTE: Such process can sometimes be reversible. It is not recommended to use it for anonymizing Personal Identifiable Information. In cases where reverting the value is allowed it can be used as a pseudonymization technique instead.
- Data blurring is an anonymization technique that uses an approximation of data values instead of the original identifiers making it difficult to re-identify individuals.
- Bucketing or generalization is an anonymization technique that replaces individual values of fields with a broader category.
- Nulling out is a technique that replaces sensitive data in the dataset by null values.

Customized anonymization uses a customized solution including one or more anonymization techniques wrapped into custom code.

Implementations¶

When it comes to the technical implementation, pseudonymization techniques are different from anonymization techniques. Pseudonymization does not remove all identifying information from the data but reduces the ability to link that data with the individual identity. With anonymization, all the information that identifies an individual is scrubbed.

There are assets available that might help with achieving some of these goals for certain technologies and use cases. For anonymization purposes, Presidio is a popular framework, which is considered to fall under the "customized anonymization technique" category. Here are some of its use cases and usage examples:

Anonymize Personal Identifiable Information using Presidio on Spark

Run Presidio on structured/semi-structured data

Anonymize Personal Identifiable Information entities in an Azure Data Factory ETL Pipeline

Learn more about the data protection and security in Microsoft Fabric¶

Microsoft Fabric addresses data protection and security through several key features:

Conditional Access: It secures your apps using Azure Entra ID.
Resiliency: It provides reliability and regional resiliency with Azure availability zones.
Lockbox: It controls how Microsoft engineers access your data.
Service tags: Enable an Azure SQL Managed Instance (MI) to allow incoming connections from Microsoft Fabric.
OneLake security:This feature helps secure your data in OneLake.
Warehouse access model: Microsoft Fabric permissions and granular SQL permissions work together to govern Warehouse access and the user permissions once connected. Examples of the granular level permissions are Object level security, Column level security, Row level security and Dynamic data masking.
Information protection: Ability to discover, classify, and protect Fabric data using sensitivity labels from Microsoft Purview Information Protection.