Optimizing Databricks¶

To optimize Databricks speed and cost efficiency, developers need to understand how query planning and execution works in Spark. When a query is submitted, first Spark creates a Logical plan. This plan is continuously improved until an Action step (collect, count, display, etc.) is triggered. Once an action method is called, Spark prepares a Physical execution plan to implement the Logical plan in the cluster.

When executing the Physical plan, Spark follows an execution hierarchy. At the top of this hierarchy are jobs. Invoking an action within Spark triggers the launch of a Spark job.

A Spark job comprises several stages. When Spark reads a function that requires a shuffle, it creates a new stage. Functions that trigger transformations, such as join or merge, trigger a shuffle and thus a new stage.

A Spark stage comprises several tasks. Every task in the stage executes the same set of instructions, each on a different subset of the data. A task is the smallest execution unit in Spark. Tasks are executed by an executor. In Databricks, each core/slot/thread of the executor executes one task.

Optimizing a cluster¶

Optimizing a cluster in Databricks generally aims to choose a cluster configuration with the most optimal speed and minimal cost.

In Databricks, running a Spark job requires at least one driver and one worker. The spark driver orchestrates execution of processing and its distribution among the Spark executors. The driver has access to the metadata of the tasks shared with the executors. Once the tasks are completed by executors, the results are shared back with the driver. In most cases, the driver cluster doesn't need to be a large cluster (for example, lots of cores and much memory). However, when doing the collect action, it brings all the data from the executors back to the driver. When dealing with a large dataset, this operation generates an out-of-memory error.

Spark achieves parallelism by checking if more cores are available after all tasks in a stage are assigned a core each. If there are more free cores, multiple jobs can start running simultaneously. So, choosing the right executor depends mainly on how many tasks the query produces during the Physical execution plan. The number of cores required depends on how many maximum tasks need to be executed in a stage. The same logic can also be applied when choosing the amount of memory per core.

The Ganglia UI in Databricks is a useful tool to monitor cluster utilization and identify opportunities to improve performance.

Databricks Ganglia UI

Notably, a larger cluster isn't necessarily more expensive than a smaller one. In fact, it can be faster as the cost is primarily determined by the duration of the workload rather than the cluster size.

Using partitioning¶

For optimization, the right partitioning decision applied to the dataset helps improve performance during reading and writing to storage. In most cases, the bottleneck of performance lies with when reading data needed from storage. Check Azure Databricks data storage for details.

Optimizing queries¶

Writing an efficient query is one of the most effective mechanisms for gaining short-term performance improvements. Here are some considerations:

Configure a suitable partitioning size on the Shuffle property.
- When doing data transformation (for example, groupBy, join), Spark shuffles data between executors. By default, Spark has 200 shuffle partitions. However, in a large dataset this number might be too small. This number can be changed in the following Spark configuration: spark.conf.set ("spark.sql.shuffle.partitions", 768). Calculating the right size for the shuffle partitions requires some testing and knowledge of the complexity of the transformations and table sizes.

Filter data as early as possible.
Use Disk Cache.
Avoid the use of user-defined functions (UDF).
Consider using Photon as much as possible.
- As Spark tasks are executed in Photon, they're visualized in the execution plan. The nodes marked in gold color are executed in Photon. It's also important to note that Photon only covers certain operators, expressions, and data types. So, it's possible that the entire execution plan isn't executed in Photon.

All transformations in Spark are lazy, in that they don't compute their results right away.

For more information¶

Optimization recommendations on Azure Databricks