We have partnered with Databricks to migrate to the Data Lakehouse architecture and reap the benefit of a unified data platform.
All our reporting queries (both internal and external) were offered to the Databricks warehouse solution. It's bundled with a resource manager powered by vectorized query engine Photon with Spark. This acts as an SQL layer over the data lake and supports easy API integration for various BI applications at scale.
After moving to Data Lakehouse, the SQL engine requirement was fulfilled by moving to Databricks' SQL warehouse.
At InMobi, we have two types of reporting customers for the Supply Side Platform (SSP) data:
External users like Publishers and Advisors
Internal users like the business users and product and Internal teams
This exposed a different set of latency requirements. During this movement, we were bound to keep the latency requirements intact for both types of users. Once we moved to the Delta Lakehouse, it became harder to meet the latency requirement. We faced challenges like an increase in P99 by ~500%, P95 by ~400%, and P50 by ~200%.
Things have moved north. We started looking at ways to get these below acceptable limits and worked very closely with the Databricks team. The problem was tackled in the below phases.
The initial setup was on a single warehouse to answer all the SQL queries. The load distribution for the light (or short-lived) vs heavy (or long-lived) queries is close to ~95% and ~5%, respectively. The interesting pattern was the scarcity of resources while running both types of queries with a single warehouse.
While running heavy queries, the warehouse wasn’t able to predict the required resources to answer the time-sensitive queries, which can be termed as the overall warehouse boundaries on the SQL engine for very high unpredictable call volume patterns.
P50 | P95 | P99 | |
---|---|---|---|
Queries | ~4 mins | ~30 mins | ~200 mins |
Based on latency requirements, the warehouse was split into two. The light queries with lower latency requirements were in one warehouse, and another was configured to process the heavy queries. This was a logically split powered by a single delta lake with segregation on strict read patterns. Both of these data warehouses were scaled up from M to 2XL.
P50 | P95 | P99 | |
---|---|---|---|
Light Queries | ~1 sec | ~8 secs | ~10 secs |
Heavy Queries | ~90 secs | ~20 mins | ~~70 mins |
The Azure storage account throttling was identified as a hard limitation during the course of action. In collaboration with Databricks support, we were able to overthrow this limitation by applying additional Spark parameters to Databricks' SQL warehouse. Please note that currently, the option to view and update the Spark configuration is not available via Azure UI. However, it can be done quickly via Databricks exposed REST API to interact with the SQL warehouse.
spark.hadoop.fs.azure.enable.autothrottling | false |
spark.hadoop.fs.azure.io.retry.backoff.interval | 250 |
spark.hadoop.fs.azure.io.retry.min.backoff.interval | 250 |
Based on the analysis of the broadcast threshold below parameters were updated to reduce the overall execution time.
Please note that the above configuration is supported with Databricks' 11.3 LTS runtime for the Job or interactive clusters.
spark.sql.autoBroadcastJoinThreshold (default 10 MB) |
209715200 |
spark.databricks.adaptive.cancelShuffleStageInBroadcastJoin.enabled | true |
This has improved the overall performance, which was better than the initial warehouse setup. Along with fine-tuning the Spark configuration, the SQL warehouses were scaled down to L size clusters with improved performance.
P50 | P95 | P99 | |
---|---|---|---|
Light Queries | ~200 msec | ~3 secs | ~5 secs |
Heavy Queries | ~26 secs | ~11 mins | ~40 mins |
As a best practice, separate out the known long-lived vs short-lived query load at warehouses if the latency requirements are strict.
As recommended, look to separate out the Databricks workload into a different subscription.
The Databricks SQL warehouse provides the Photon engine and caching by default – an interactive UI interface to analyze the queries and tweak Spark parameters. This is handy for tuning the data warehouse Spark configuration as per the need.
Sign up with your email address to receive news and updates from InMobi Technology