Skip to content

Glossary

Active Learning

Active learning is a special case of machine learning in which a learning algorithm can interactively query a user (or some other information source) to label new data points with the desired outputs.

See Active Learning (Wikipedia).

AIOps

AIOps, (for artificial intelligence for IT operations) is the application of artificial intelligence (AI) to enhance IT operations. Specifically, AIOps uses big data, analytics, and machine learning capabilities to: Collect and aggregate the huge and ever-increasing volumes of operations data generated by multiple IT infrastructure components, applications, and performance-monitoring tools; Intelligently sift ‘signals’ out of the ‘noise’ to identify significant events and patterns related to system performance and availability issues; Diagnose root causes and report them to IT for rapid response and remediation—or, in some cases, automatically resolve these issues without human intervention.

See What is AIOps? (IBM).

Anomaly Detection

Anomaly detection is a process in machine learning that identifies data points, events, and observations that deviate from a data set’s normal behavior. And, detecting anomalies from time series data is a pain point that is critical to address for industrial applications.

See Anomaly Detection (IBM).

AutoML

Automated machine learning, also referred to as automated ML or AutoML, is the process of automating the time-consuming, iterative tasks of machine learning model development. It allows data scientists, analysts, and developers to build ML models with high scale, efficiency, and productivity all while sustaining model quality.

See AutoML.

Azure Blob Storage

Azure Blob Storage is Microsoft's object storage solution for the cloud. Blob storage is optimized for storing massive amounts of unstructured data. Unstructured data is data that doesn't adhere to a particular data model or definition, such as text or binary data.

See Azure Blob Storage.

Azure Data Lake Storage (ADLS) Gen2

Azure Data Lake Storage Gen2 is a set of capabilities dedicated to big data analytics, built on Azure Blob Storage. It provides file system semantics, file-level security, Hadoop compatibility, scale and more. Because these capabilities are built on Blob storage, it already includes low-cost, tiered storage, with high availability/disaster recovery capabilities. Data Lake Storage Gen2 makes Azure Storage the foundation for building enterprise data lakes on Azure.

See:

Bounding Boxes

A bounding box is an imaginary rectangle that serves as a point of reference for object detection and creates a collision box for that object in projects on image processing.

See Bounding Boxes with Azure Read API.

Citizen Developer

A citizen developer is an employee who creates application capabilities for consumption by themselves or others, using tools that are not actively forbidden by IT or business units. A citizen developer is a persona, not a title or targeted role. They report to a business unit or function other than IT.

See Citizen Developer (Gartner).

Collaborative Application

Collaborative apps integrate data from multiple applications into a single viewpoint as a multi-player experience that streamlines communication and makes business processes easy to track through a single pane.

See Collaborative Apps for Hybrid Work.

Container

A runtime instance of an OCI image; the configuration, execution environment and lifecycle thereof. Defined by the OCI Runtime Specification

See OCI Runtime Specification (Open Container Initiative). 7

Container Image Repository

A set of OCI images and artifacts within a registry with the same name. Both images and artifacts can be tagged to denote different versions in a human-readable format. alias(es): repository

See Container Registry Concepts.

Container Registry

Hosted storage for OCI images and OCI artifacts. Usually compliant with the OCI Distribution Specification ensuring a uniform API protocol regardless of platform. alias(es): OCI-compliant registry, registry

See OCI Distribution Specification (Open Container Initiative).

Continuous Delivery

Continuous Delivery for Machine Learning (CD4ML) is a software engineering approach in which a cross-functional team produces machine learning applications based on code, data, and models in small and safe increments that can be reproduced and reliably released at any time, in short adaptation cycles.

See What is Continuous Delivery?.

Continuous Delivery vs. Continuous Deployment

Along with continuous integration, continuous delivery and continuous deployment are practices that automate phases of software delivery. These practices enable development teams to release new features, enhancements and fixes to their customers with greater speed, accuracy and productivity.Continuous delivery and continuous deployment have a lot in common. To understand the differences between these practices—and find out which one you want to implement—we need to identify the phases of software delivery we can automate.

See Continuous delivery vs. continuous deployment.

Continuous Integration

Continuous Integration (CI) is the process of automating the build and testing of code every time a team member commits changes to version control. CI encourages developers to share their code and unit tests by merging their changes into a shared version control repository after every small task completion. Committing code triggers an automated build system to grab the latest code from the shared repository and to build, test, and validate the full main, or trunk, branch.

See Continuous Integration.

Continuous Training

Continuous Training relates to supporting the automatic and continuous retraining of a Machine Learning model in production to enable that model to adapt to real-time changes in the data, or to continuously learn from a stream of data.

CWEP

Code-With-Engineering-Playbook.

DAST

Dynamic Application Security Testing tools, when integrated into the continuous integration / continuous delivery pipeline, will help quickly uncover security issues only apparent when all components are integrated and running.

See Microsoft Security Engineering - Tools and Automation

Data Annotation

Data annotation is the process of analyzing raw data and adding metadata to provide context about each record. However, as opposed to Data Labeling, annotations are not simple unidimensional variables, but can be complex objects or lists of objects.

See:

Data Anonymization

Data anonymization is the process by which data is altered in such a way that a data subject can no longer be identified directly or indirectly, either by the data controller alone or in collaboration with any other party.

See Data Anonymization (Data Protection Commission).

Data Curation

Data curation is the process of identifying and prioritizing the most useful data for training a model.

See Data Curation.

Data Drift

Data drift is the change in data over time that causes a deviation between the current distribution, and the distribution of the data that the underlying model was trained, tested and validated on. This drift can result in models no longer accurately predicting on real world data.

See:

Data Enrichment

Data enrichment is a general term that refers to processes used to enhance, refine or otherwise improve raw data. Within the context of MLOps, we will refer to data enrichment as the process of enriching data using ML models and techniques.

See Data Enrichment.

Data Governance

Data governance is a set of capabilities that enables an organization to ensure that high data quality exists throughout the complete lifecycle of the data, and data controls are implemented to support business objectives. It encompasses the people, processes, and technologies required to discover, manage, and protect data assets.

See:

Data Ingestion

Data ingestion is the process of importing and transferring data from a source into a data storage system.

See Data Ingestion.

Data Integration

Data integration is the process of consolidating related data and unifying different data formats and data types from a wide range of data sources into a unique, accurate and comprehensive view of the data.

Data Labeling

Data labeling is the process of adding metadata and information to existing data. This helps in enriching existing data and is useful for downstream processes to act on them. It is not required for all engagements and projects but highly useful for engagements dealing with unstructured blob data i.e., images, videos and documents.

See Data Labeling.

Data Lake

A data lake is a storage repository that holds a large amount of data in its native, raw format. Data lake stores are optimized for scaling to terabytes and petabytes of data. The data typically comes from multiple heterogeneous sources, and may be structured, semi-structured, or unstructured. The idea with a data lake is to store everything in its original, untransformed state. This approach differs from a traditional data warehouse, which transforms and processes the data at the time of ingestion.

See:

Data Mesh

Data Mesh is a relatively new architectural pattern for implementing enterprise data platforms in large, complex organizations. It is a socio-technical approach to build a decentralized data architecture by leveraging a domain-oriented, self-serve design where domain data is treated as a product. Data Mesh mainly focusses on the business value of the data itself, leaving the Data Lake and the pipelines as secondary concerns.

See:

Data Obfuscation or Pseudonymization

Data obfuscation or pseudonymization, means replacing any information which could be used to identify an individual with a pseudonym, or, in other words, a value which does not allow the individual to be directly identified. It can still allow for some form of re-identification of the data.

See Pseudonymization (Data Protection Commission).

Data Orchestration

Data Orchestration is the process that ensures tasks in data pipelines are executed in the correct order. A data orchestrator coordinates and manages dependencies between these tasks.

Data Partitioning

Data partitioning is a technique that can improve scalability, reduce contention, and optimize performance. It can also provide a mechanism for dividing data by usage pattern. In many large-scale solutions, data is divided into partitions that can be managed and accessed separately.

See Data partitioning guidance.

Data Pipelines

Data pipeline is the end to end process that allows data from a source to a destination through a sequence of tasks that implement data ingestion, transformation, integration, or enrichment typically done for business analytical purposes.
In an Enterprise setup, it is recommended that this process is secured, governed, and automated. There are two types of Data pipelines:
1. Batch Data Pipelines, are data pipelines designed to process high volumes of data from a wide range of data sources during a specific scheduled time window.
2. Streaming Data Pipelines, are data pipelines designed to ingest data in near real time from diverse streaming sources (e.g.: sensors, IoT, etc.).

See Data pipelines in Data.

Data Privacy

Data privacy, sometimes also referred to as Information privacy, is the relationship between the collection and dissemination of data, technology, the public expectation of privacy, and the legal and political issues surrounding them. It deals with the concerns of protecting an individual's privacy preferences and personally identifiable information while storing, processing and sharing data.

See Information Privacy (Wikipedia).

Data Science Toolkit

The data science toolkit is an open-source collection of proven ML and AI implementation accelerators. Accelerators enable the automation of commonly repeated development processes to allow data science practitioners focus on delivering complex business value and spend less time on basic setup.

See Data Science Toolkit.

Data Quality

Data quality is the degree to which your data is accurate, complete, timely, and consistent with your organization's requirements. You need to constantly monitor your data sets for quality to ensure that the data applications they power remain reliable and trustworthy.

See Data Quality in Data.

Data Security

Data security refers to the controls, standard policies and procedures implemented by an organization in order to protect its data from data breaches and attacks and to prevent data loss through unauthorized access.

See:

Data Skipping

Data skipping is an algorithm which aims to automatically collect information when you write data into a Delta Lake table. Delta Lake takes advantage of this information (minimum and maximum values for each column) at query time to provide faster queries.

See Data Skipping (Delta Lake).

Data Versioning and Lineage

Data Versioning is the practice of governing and organizing training datasets in order to ensuring the reproducibility of Machine Learning experiments. Data lineage is the process of tracking the flow of data over time, from its provenance, transformations applied, and its final output and consumption within a Data Pipeline.

See:

Data Virtualization

Data virtualization is an approach to data management that provides a single/unified representation of the data, and allows for data retrieval without requiring technical details about the data such as how the data is formatted at the source or where it is physically stored. Being a single point of access to all enterprise data siloed across disparate systems, Data Virtualization allows for centralized security and governance. Please check here for more information.

See Data virtualization (Wikipedia).

Database Normalization

Database normalization is the process of structuring a relational database in accordance with a series of formal rules called as normal forms in order to reduce data redundancy and improve data integrity. Informally, a relational database relation is often described as "normalized" if it meets third normal form.

See Database Normalization (Wikipedia).

DataOps

DataOps is a lifecycle approach to data analytics. It uses agile practices to orchestrate tools, code, and infrastructure to quickly deliver high-quality data with improved security. When you implement and streamline DataOps processes, your business can more easily and cost effectively deliver analytical insights. This allows you to adopt advanced data techniques that can uncover insights and new opportunities. Use this checklist as a starting point to assess your DataOps process.

See:

DevOps

The union of people, process, and technology to enable continuous delivery of value to customers. The practice of DevOps brings development and operations teams together to speed software delivery and make products more secure and reliable.

See What is DevOps?.

Distributed Training

In distributed training the workload to train a model is split up and shared among multiple mini processors, called worker nodes. These worker nodes work in parallel to speed up model training. Distributed training can be used for traditional ML models, but is better suited for compute and time intensive tasks, like deep learning for training deep neural networks.

See Distributed training with Azure Machine Learning.

EFR

Engineering for Reuse.

ETL

Extract, Transform, Load is the process by which data is extracted from different sources, transformed into a usable resource, and loaded into systems that may be accessed and consumed downstream to solve business problems or build data products

See ETL in Azure.

Explainability/Interpretability

Explainability relates to the understanding of what influences the behavior of ML models

See Model interpretability in Azure.

Feature Engineering

Feature engineering is the process of using domain knowledge to supplement, cull or create new features to aid the machine learning process with the goal of increasing the underlying model's predictive power.

See Feature Engineering.

Feature Store

A feature store is a data system designed to manage raw data transformations into features - data inputs for a model. Feature Stores often include metadata management tools to register, share, and track features as well.

See Feature Store.

Feature Serving

Feature serving refers to the capability to serve feature values for both batch operations like training (high latency) and low latency for inference. It hides away some of the complexity when querying the feature values while providing functionality like point-in-time joins.

See Feature Store.

Feature Transformation

Feature transformation refers process of converting raw data into features. This generally requires the building of data pipelines to ingest both historical and real-time data.

See Feature Store.

Fusion Team

A fusion team is a multidisciplinary team that blends technology or analytics and business domain expertise and shares accountability for business and technology outcomes. Instead of organizing work by functions or technologies, fusion teams are typically organized by the cross-cutting business capabilities, business outcomes or customer outcomes they support.

See Fusion Team (Gartner).

GEM

General Engineering Manager.

Ground Truth

The Ground truth data represents the correctly labeled data according to a business domain user. In other words, the ground truth represents the data the model needs to predict.

See Ground Truth (Wikipedia).

Hyperparameter Tuning

Hyperparameters are adjustable parameters that let you control the model training process. For example, with neural networks, you decide the number of hidden layers and the number of nodes in each layer. Model performance depends heavily on hyperparameters. Hyperparameter tuning, also called hyperparameter optimization, is the process of finding the configuration of hyperparameters that results in the best performance. The process is typically computationally expensive and manual.

See Hyper Parameter Tuning.

Hyperautomation

Hyperautomation is a business-driven, disciplined approach that organizations use to rapidly identify, vet and automate as many business and IT processes as possible. Hyperautomation involves the orchestrated use of multiple technologies, tools or platforms, including: Artificial intelligence (AI), machine learning, event-driven software architecture, robotic process automation (RPA), business process management (BPM) and intelligent business process management suites (iBPMS), integration platform as a service (iPaaS), & low-code/no-code tools

See Hyperautomation (Gartner).

Intelligent Automation

Intelligent automation is the combination of artificial intelligence, machine learning and process automation used to create smarter processes.

See Intelligence Automation (Cognizant).

ISE

Industry Software Engineering, a team of friendly Microsoft engineers who work with customers to create production-ready solutions.

Label Auditing

Label auditing, is the process of checking and validating the labels applied to a training dataset. This can be either a manual process or automated and then verified manually.

See Automated Label Auditing.

Jenkins

MLOps with Jenkins

See Jenkins MLOps Template.

Labeling Marketplace

A marketplace of labeling providers, to allow for the finding of the most suitable way to annotated or label the data, based on the use case and constraints.

See:

LLM

Large Language Model.

See Large Language Model (Wikipedia).

Low Code

A method of software and application development that allows users to create enterprise-grade business apps and automations using drag-and-drop visual designers and simple Excel-like expressions.

See What is low-code development?.

Massively Parallel Processing (MPP)

Massively parallel processing (MPP) is the coordinated processing of a single task by multiple processors, each processor using its own OS and memory and communicating with each other using some form of messaging interface. MPP can be setup with a shared nothing or shared disk architecture.

See Transitioning from SMP to MPP, the why and the how.

ML Pipeline

A machine learning pipeline is a sequence of steps that are orchestrated to manage and automate the flow of data in and out of an ML process.

See AML Pipelines.

MLOps

Machine learning operations (MLOps) is based on DevOps principles and practices that increase the efficiency of workflows. For example, continuous integration, delivery, and deployment. MLOps applies these principles to the machine learning process, with the goal of: Faster experimentation and development of models; Faster deployment of models into production; Quality assurance and end-to-end lineage tracking

See MLOps - ML Model Management - Microsoft

Model Factory

A machine learning (ML) model factory is a system for automatically building, training, and deploying ML models at scale. It includes a variety of features that make it easier to create and manage large numbers of ML models, as well as automate the model building process

See MLOps Model Factory Template.

ModelOps

ModelOps (or AI model operationalization) is focused primarily on the governance and lifecycle management of a wide range of operationalized artificial intelligence (AI) and decision models, including machine learning, knowledge graphs, rules, optimization, linguistic and agent-based models. Core capabilities include continuous integration/continuous delivery (CI/CD) integration, model development environments, champion-challenger testing, model versioning, model store and rollback.

See Definition of ModelOps (Gartner).

No-Code

A method of software and application development that allows users to create enterprise-grade business apps, forms, and automations using drag-and-drop visual designers without the need to write any code.

See What is a no-code app builder?.

OARP

Model to designate roles: Owner, Approver, Responsible, Participant.

OCI

Open Container Initiative, project formed to create open industry standards for container formats and runtimes.

See About the Open Container Initiative.

OCI Image

An immutable file made up of layers. Defined by the OCI Image Specification; alias(es): container image, image

See OCI Image Specification (Open Container Initiative).

OCI Image Layer

A blob. A serialized filesystem AND/OR filesystem changes including additions, modifications or removals. alias(es): layer

See Filesystem Layers (Open Container Initiative).

Online Analytical Processing (OLAP)

Online analytical processing (OLAP) is a technology that organizes large business databases and supports complex analysis. It can be used to perform complex analytical queries without negatively affecting transactional systems.

See:

Observation Data

Observation data refers to the raw input for the data being queried in the Feature Serving layer, and it's composed of at least the IDs of the entities of interest and timestamps; both exist as join keys. This concept is called Entity data frame in other feature stores

See Feature Store.

Online Transaction Processing (OLTP)

The management of transactional data using computer systems is referred to as online transaction processing (OLTP). OLTP systems record business interactions as they occur in the day-to-day operation of the organization, and support querying of this data to make inferences.

See:

OPA

Open Policy Agent, supports policy-based control in cloud native environments with a general purpose policy engine and a high-level declarative language.

See www.openpolicyagent.org.

ORAS

OCI Registry As Storage, a project to provide a way to push and pull OCI Artifacts to and from container registries

See What is ORAS?.

OSS

Open Source Software, from a security perspective Microsoft defines OSS as "any source code, language package, module, component, library, or binary that you can consume into your software project as a dependency that does not have a paid-support contract." Nuance surrounding licensing comes into play when the term is used in a broader context.

See:

OWASP

The Open Web Application Security Project (OWASP) is a non-profit foundation that works to improve the security of software. Through community-led open-source software projects, hundreds of local chapters worldwide, tens of thousands of members, and leading educational and training conferences, the OWASP Foundation is the source for developers and technologists to secure the web.

See OWASP.org.

Point-in-Time Joins (PITJ)

For time series data it is important to make sure that the data used for training is not mixed with the latest data ingested as doing so will create feature leakage (AKA label leakage). PITJ makes sure that data served corresponds to the closest observation times.

See Feature Store.

Professional Developer

Professional developer is a persona that represents people with a traditional software engineering background. Software engineers typically work as part of a development team to implement solutions by writing code. Professional developers apply software engineering fundamentals to create high-quality production ready software.

See Software Engineering (Wikipedia).

Prompt Engineering

Prompt engineering is a relatively new discipline for developing and optimizing prompts to efficiently use language models (LMs) for a wide variety of applications and research topics. Prompt engineering skills help to better understand the capabilities and limitations of large language models (LLMs).

See Prompt Engineering.

Real-Time Processing

Real-time processing deals with streams of data that are captured in real-time and processed with minimal latency to generate real-time (or near-real-time) reports or automated responses. For example, a real-time traffic monitoring solution might use sensor data to detect high traffic volumes.

See:

Reinforcement Learning

Reinforcement learning is a category of Machine Learning algorithms that enables an agent to learn in an interactive environment through trial and error, by using the feedback generated from its own actions and experiences.

See Reinforcement Learning with Azure Machine Learning.

Responsible AI

Responsible Artificial Intelligence (Responsible AI) is an approach to developing, assessing, and deploying AI systems in a safe, trustworthy, and ethical way. AI systems are the product of many decisions made by those who develop and deploy them. From system purpose to how people interact with AI systems, Responsible AI can help proactively guide these decisions toward more beneficial and equitable outcomes. That means keeping people and their goals at the center of system design decisions and respecting enduring values like fairness, reliability, and transparency.

See Responsible AI with Azure.

Robotic Process Automation

Robotic process automation (RPA) is the process of automating business procedures through mimicking and automatically executing rule-based tasks. Through RPA, a machine copies and records the set of actions that a user takes to complete a task.

See What is RPA?.

SAST

Static Application and Security Testing tools provide deep analytical insight into the syntax, semantics, and provide just-in-time learning, preventing the introduction of security vulnerabilities before the application code is committed to your code repository.

See Microsoft Security Engineering - Tools and Automation

SBOM

Software Bill of Materials, a document outlining an inventory of software components

See CISA on Software Bill of Materials.

SCA

Software Composition Analysis (SCA), assist with licensing exposure, provide an accurate inventory of third-party components, and report any vulnerabilities with referenced components.

See Microsoft Security Engineering - Software Composition Analysis

Semi-Supervised Learning

Semi-supervised learning algorithms are designed to learn an unknown concept from a partially-labeled data set of training examples. They are widely popular in practice, since labels are often very costly to obtain. This talk is about a new approach to semi-supervised learning that addresses a mismatch between the way semi-supervised learning algorithms have been developed and the way they are commonly used.

See Robust Semi-Supervised Learning (Microsoft Research).

SIEM

Security Information and Event Management provides the ability to gather security data from information system components and present that data as actionable information via a single interface.

See NIST CSRC.

Slowly Changing Dimension (SCD)

A slowly changing dimension (SCD) in data management and data warehousing is a dimension which contains relatively static data which can change slowly but unpredictably, rather than according to a regular schedule. The most common types of SCDs are Type 1, Type 2, and Type 3.

See:

SME

Subject Matter Experts

SOAR

Security Orchestration, Automation and Response refers to technologies that enable organizations to collect inputs monitored by the security operations team. For example, alerts from the SIEM system and other security technologies — where incident analysis and triage can be performed by leveraging a combination of human and machine power — help define, prioritize and drive standardized incident response activities.

See SOAR (Gartner).

Software Supply Chain

The term software supply chain is used to refer to everything that goes into your software and where it comes from. It is the dependencies and properties of your dependencies that your software supply chain depends on.

See Security Best Practices.

Software-as-a-Service

Software-as-a-Service (SaaS) refers to cloud-based apps that allow users to connect to and use over the Internet. SaaS apps often come with significant out-of-the-box capabilities and can be customized through no-code, low-code, and high-code extensibility interfaces. Common examples are email & calendaring (such as Microsoft Office 365), communications (such as Microsoft Teams and Slack), CRM (such as Microsoft Dynamics & Salesforce), ERP (Such as Microsoft Dynamics & SAP).

See What is SaaS?.

Streaming Data

Streaming data is data that flows into a system and is continuously collected and stored.

See Streaming Data Ingestion.

Structured Data vs Unstructured Data

Structured data can relate not only to data that relates to a schema or data model, but can also include forms that contain a set structure. Unstructured data is information that is not arranged according to a pre-set data model or schema, and includes data such as videos, free form documents and images.

See Structured vs Unstructured data - IBM.

Supervised Learning

In supervised learning, each data point is labeled or associated with a category or value of interest. An example of a categorical label is assigning an image as either a ‘cat’ or a ‘dog’. An example of a value label is the sale price associated with a used car. The goal of supervised learning is to study many labeled examples like these, and then to be able to make predictions about future data points. For example, identifying new photos with the correct animal or assigning accurate sale prices to other used cars. This is a popular and useful type of machine learning.

See Supervised Learning.

Supply Chain Attack

Supply chain attacks are an emerging kind of threat that target software developers and suppliers. The goal is to access source codes, build processes, or update mechanisms by infecting legitimate apps to distribute malware.

See Supply Chain Malware.

Synthetic Data Generation

Synthetic data is artificially generated data that is representative of real data. This can be very beneficial for a number of reasons, for example PII data and other sensitive data can be removed or obfuscated so that the data can be shared more broadly. It does require investment and development effort to be able to generate representative synthetic data but does offer a lot of flexibility.

See Synthetic Data Generation in MLOPs.

Threat Modeling

Threat modeling is an engineering technique used to identify threats, attacks, vulnerabilities, and countermeasures that could affect an application. It allows the shaping of an application's design, meet a company's security objectives, and reduce risk.

See Microsoft Security Engineering - Threat Modeling

Unsupervised Learning

In unsupervised learning, data points have no labels associated with them. Instead, the goal of an unsupervised learning algorithm is to organize the data in some way or to describe its structure. Unsupervised learning groups data into clusters, as K-means does, or finds different ways of looking at complex data so that it appears simpler.

See:

VEX

Vulnerability Exploitability eXchange, a document which indicates whether a product or products are impacted by known vulnerabilities

See NITA overview on VEX.

X++

X++ is the domain-specific programming language used to make code modifications to Dynamics 365 for Finance and Supply Chain Management (D365 F&SCM). It looks and behaves similarly to C#, but has a separate compiler and runtime.

See X Language Programming Guide.

Z-Ordering

Z-Ordering is a technique to colocate related information in the same set of files. This co-locality is automatically used by Delta Lake in data-skipping algorithms. This behavior dramatically reduces the amount of data that Delta Lake on Apache Spark needs to read.

See Z-Ordering (Delta Lake).