Skip to content

Understanding dedicated SQL pool best practices

Use this content/information if you want to know the design considerations to keep in mind when considering dedicated SQL pools (formerly SQL DW).

Choosing Synapse dedicated SQL pools for compute

Azure Synapse offers multiple compute options (as explained in Get Started with Azure Synapse Analytics), one of them is dedicated SQL pools.

Dedicated SQL pools have Massively Parallel Processing (MPP) architecture, which is different from a traditional Symmetric Multiprocessing (SMP) database. As a result, some of the concepts that work for traditional SMP databases like Azure SQL don't apply or can't be translated to an MPP architecture.

In an SMP database, processors share resources like storage and memory. In contrast MPP is often called a “shared nothing” architecture, since processors don't share their resources. SMP systems provide increased throughput and reliability: if one processor fails, another can fill its place quickly. As a result, SMP systems can dynamically balance a workload to serve more users faster. Due to this architecture, they're ideal for situations where there is high concurrency of simple queries. Conversely, MPP systems are used for large data volumes with analytical use cases, as they allow scaling out computational processing across multiple nodes.

Dedicated SQL pools in Synapse allows decoupling of compute from from storage (storage is sometimes also referred to as data distributions). The number of compute nodes depends on the Service Level Objective (SLO), or the number of Data Warehouse Units (DWU), while the data distributions are independent of the SLO and are fixed at 60. As a result, the number of distributions per nodes varies with the SLO. For understanding how the ratio changes with different service levels, see Synapse Service Levels.

Dedicated SQL pools don't have auto scaling capabilities, or auto pause features. But these operations can be scheduled/automated to some extent. For examples see how to pause and resume dedicated SQL pools with Synapse Pipelines and Quickstart: Scale compute in dedicated SQL pool.

Optimize data ingestion in dedicated SQL pools

When data is ingested into Dedicated SQL Pools, it is sharded into 60 distributions to optimize system performance. The sharding pattern to distribute data can be selected when you define the table. As a rule of thumb, the best distribution strategy should minimize the amount of data movement to fulfill query demands.

When loading large amounts of data, it is good practice to group INSERT statements into batches: by developing one process that writes to a file, and then another process to periodically load from this file Group INSERT statements into batches.

PolyBase is generally the best choice when you are loading or exporting large volumes of data, or you need faster performance. You can use PolyBase in CTAS statements or directly from Azure Data Factory pipelines.

If you are loading data to a hash-distributed table, your ingested data shouldn't be ordered by the hash-distribution key (see Designing tables for dedicated SQL pools section).

Data loading might affect reads or conversely reads might affect data loading: loading data to a table requires an exclusive lock on the table. A loading operation will have to wait until all locks, including read locks, on the table are released. At the same time, when a loading operation is executed, all other operations on that table, including reads, will have to wait. For this reason, partition switching is the recommended approach. The main reason for such recommendation being that partition switching is just a metadata operation and as such is executed quickly. Using partition switching reduces the duration of table lock reducing impact on reads. In summary, the practice recommends loading data to a staging table that has identical table definition and boundaries to the production table. After the load is completed on the staging table, the partition can be switched to the production table. For an example on partition switching, see the GitHub sample.

Designing tables for dedicated SQL pools

In Dedicated SQL Pools, regular tables (for example, non-temporary tables) store data in Azure Storage and by default are distributed. Dedicated SQL Pools support three ways of distributing tables: Round-Robin, Hash, Replicated.

Distributed tables can be either round-robin or hash distributed. With hash distribution, you can guarantee that rows with equal distribution key values will be on the same node. For round-robin tables, the distribution of rows is random. For more details on how to choose the best distribution: Distributed tables design guidance.

Replicated tables aren't distributed across multiple nodes but are instead replicated across all nodes. With a replicated table all nodes have a copy of the full table available locally. As a result, replicated tables are recommended when join conditions would otherwise require data movement. Replicated tables are also a good choice when the table size on disk is less than 2 GB. For more information, see Design guidance for replicated tables.

Improving query performance

You can improve query performance by choosing the right distribution method (as explained also above) and sometimes by partitioning tables.

NOTE: In Synapse Dedicated SQL Pools, tables will always be sharded (partitioned) into 60 distributions no matter what, so partitioning a table implies further splitting the table and should be done with caution. In particular, you should take into account that if the table has a Clustered Columnstore Index (CCI), partitioning the table will result in partitioning the CCI too. For optimal performance, CCI should always have at least 1 million rows per partition and distribution. For more information, see Partition Sizing.

When analyzing query performance, table redesign might be needed (distribution, for example). For more information, see query performance troubleshooting/tuning techniques: How to minimize data movements (Compatible and Incompatible Joins).

Columnstore indexes and table statistics are essential to query performance. For detailed information, see Best Practices for Maintaining Statistics and Statistics in Synapse SQL.

For query performance, it's also important to remember that memory and resource limits are determined by the chosen SLO (number of DWU), when you are short of other options, you might need to increase the SLO to achieve better performance. SLO also impacts the maximum number of concurrent queries. As a reference, there is a maximum of 128 concurrent queries that can run at any time with the largest SLO (DW30000c). For more information, see Memory and Concurrency Limits.

Dedicated SQL Pools have a concept of resource classes to determine the performance that users/processes can use for queries. The concurrency limits, in conjunction with resource classes within a given SLO, significantly affect the maximum performance achievable by a resource class. For detailed information, refer to the Concurrency maximums for resource classes.

Manage monitoring and observability

Monitoring and observability are essential for ensuring that your workloads and queries can run and are in a healthy state. The Azure Synapse Toolbox is a collection of useful tools and scripts to help you manage and monitor the health of your Azure Synapse Analytics workspaces.

Enabling alerts allows you to detect workload issues earlier, providing an opportunity to take action sooner and minimize the impact on end users. For more information, see the community article.

There are many options available for monitoring your Dedicated SQL Pool. For a walkthrough of implementing a monitoring solution for Azure Synapse Analytics, see the community article.

How to manage security

When it comes to security, Dedicated SQL Pools can be secured like any other Azure SQL database. For an extensive overview, see Secure a dedicated SQL pool.

For more information