September 13, 2024

Caching & Reuse of Subresults across Queries

No items found.

TL;DR

Ever wonder how you can speed up complex queries without additional resources? In this post, we will show you how Firebolt optimizes query performance through caching and reusing results of parts of the query plan (= subresults) across consecutive queries.

Introduction

At Firebolt we are building a data warehouse enabling highly concurrent & very low latency analytics. Another way to frame this is that we strive to be faster than the blink of an eye:

Median query latency in milliseconds across all Firebolt customers in prod. Hourly, over the period of a few days.

Firebolt’s main use cases are “data intensive applications”, such as interactive analytics across various industries. These workloads typically consist of high volume, sub-second queries stemming from a mix of tens or even hundreds of patterns.

Such repetitive workloads can benefit tremendously from reuse / caching. In analytics systems, caching as a concept is ubiquitous: from buffer pools over full result caching to materialized views. Here, we will present our findings in a surprisingly little-used approach: caching subresults of operators. The idea itself is not new, with first publications appearing in the ‘80s, e.g., [Fin82], [Sel88]. More recent results include [IKNG10], [Nag10], [HBBGN12], [DBCK17], and [RHPV et al. 24].

In this blog post, you will discover how we implemented a cache for subresults of arbitrary operators in the query plan, including hash tables of hash-joins, and how this can significantly enhance the system's performance. By sharing the impact we have seen in our production environment, you will gain insight into how caching can help reduce query times and improve efficiency for your own workloads. Additionally, we will explore a variation of the eviction strategy which we benchmarked and tuned on real-world data, showing how it outperforms traditional LRU, ultimately saving you time and resources.

Since our cache for now is purely in main memory, we ideally store the subresults in compact data structures which have a small RAM footprint. We therefore devised a custom, space-optimized hash table for our novel “FireHashJoin”. With it, we achieve more than 5x memory savings (as measured on production data) compared to the hash table implementation we used before. It thus enables us to cache significantly more of such subresults in the same amount of RAM. We will present this in a follow-up post.

What do we Mean by “Subresult”?

Let us have a look at the following made up query based on the TPC-H benchmark schema (see, e.g., page 13 here).

Below we show a simplified query plan for this query. This is a graph which has “operators” (such as “Aggregate”) as nodes. In our figure the data flows from the bottom to the top and each operator takes care of one processing step, such as aggregating the incoming o_totalprice & row-counts and grouping them by n_name. For a more detailed explanation, see, e.g., this Wiki page.

So here is the query plan for the example query above:

For each operator, we call the entire data streamed out of this operator a subresult – it is what the operator and the subplan below it have computed. Example: all combinations of c_custkey, n_name streamed out of the lower-right Join together form the subresult of that Join operator.

We also refer to the subresult streamed out of the Aggregate operator at the top as the full result.

For the final artifact which we consider a subresult, let us add a quick side note about our Join operator. We implemented it as a so-called hash join algorithm. This is perhaps the most common join implementation used in databases. Briefly summarized, the algorithm proceeds in these two phases (for more details, see, e.g., this Wiki page): 

  1. In the build phase it computes a hash table for the “build side” of the join. In our case this is the right side of each of the joins depicted above. For the lower join, the computed hash table maps n_nationkey to n_name as read from the nation table.
  2. In the probe phase the algorithm iterates over all rows of the “probe side” of the join (in our figure, the left sides of the two joins are the probe sides). For each row it probes the hash table, i.e., looks up the corresponding key coming from the left. In our example for the lower join, the algorithm would lookup n_name given the c_nationkey coming from the left side. The operator then outputs the combination c_custkey, n_name for each of the probe rows.

We also consider the hash tables computed by Join operators as “special” subresults of the query plan. In our example, the lower join builds a hash table n_nationkey => n_name and the upper join a hash table c_custkey => n_name.

So to summarize, each operator produces a subresult (the data it streams out) and Join operators also create hash tables as “special” subresults.

Why and How to Cache & Reuse Subresults

Let us now have a look at why one would cache and then later reuse subresults of consecutive queries. For similar approaches and in-depth motivation & impact, see, e.g., [IKNG10], [Nag10], [HBBGN12], and the more recent papers [DBCK17] and [RHPV et al. 24]. Below we give more details about the actual impact of our implementation in production. 

For full results the motivation for caching is pretty obvious: why recompute if the exact same query comes in a second (or third or …) time. If the full result is small enough, we can keep it in a cache and reuse it as long as the data scanned did not change. Workloads with many consecutive, exactly identical queries are actually more common than one might think. The recent VLDB paper analyzing Amazon Redshift’s workload [RHPV et al. 24] impressively shows this somewhat surprising fact. Below we also share some stats on our production workload.

Many query engines have a full result cache. Often it is simply based on a hashmap of query-strings to full results. We do this in a somewhat different manner which has advantages given later in this section. But we actually go much further, our system supports caching (and reusing) results of any subplan. These subresults are placed in our in-memory FireCache which can use up to 20% of the available RAM. Here is how we determine which subresults to cache:

  • On the one hand, our optimizer may add so-called MaybeCache operators above any node in the plan. As the name suggests, this operator may cache a subresult – if it is not too large. It may later fetch from the cache and reuse a subresult if the exact same subplan (with the same data being scanned) is evaluated again.
    Currently, the optimizer places a MaybeCache operator 1) at the top of the plan, for a full result cache and 2) at nodes where “sideways information passing” is happening to speed up joins (where the probe-side has an index on the key that is being joined on). For 2) let us only point out that this can be very beneficial for the case described in the next bullet. We will not go into more details here for simplicity.
    The MaybeCache operator is versatile, it can be placed anywhere in the plan. In the future, we may investigate adding it above other, selected operators, e.g., pipeline breakers such as Aggregate.
  • On the other hand, our system places any subresult hash-tables built for Join operators in the FireCache if they are not too large. The reasoning behind this is that these hash tables are relatively expensive to compute and it is very advantageous to reuse them in case similar queries come in consecutively.

So this is what the simplified plan for our example query would look like and what would be placed in the FireCache on the first evaluation:

A MaybeCache has been placed at the top of the plan and it places the subresult collected on the first run into the FireCache, as do both Join operators with their respective hash tables.

On a subsequent run of exactly the same query (over unchanged data), the MaybeCache fetches the subresult from the cache and the entire evaluation can be skipped. This means the latency drops to very low milliseconds. In our toy example this corresponds to a speed boost of more than 100x (on a single node, medium engine running over TPC-H with scale factor 100).

But what if the user adds a filter which may be changed across each of the subsequent queries? This is where caching actual subresults kicks in. Say the user wants to restrict the date range, e.g., by adding ... AND o_orderdate >= '1998-01-01' ... (for changing dates) to the query:

The following figure shows how the query plan and the corresponding cache interactions would look like if the query without the restriction was run previously:

Notice that the added filter has been pushed down by the optimizer to be right above the scan of orders. The plan as a whole has changed, therefore the subresult stored by the MaybeCache previously cannot be reused. On the other hand, the right subplan below the upper Join (surrounded by a dashed box), has not changed! Therefore the previously computed and cached hash table can be reused in that Join operator. This saves us from evaluating the subplan and building the hash table again.

In the toy example, this leads to a speed boost of > 5x on subsequent queries (even when each query has a different date restriction).

Note that we would still profit from the FireCache in case restrictions on the customer table would be added – here we would be able to reuse the hash table for the lower Join. To further extend the utility of the cache, the optimizer could decide to lift up such filters to have larger common subplans across consecutive queries. This is something we plan to look into in the future.

Avoiding Thrashing

The example also already shows one potential (and common) challenge with such caches: how to handle possibly many useless items being added to the cache. Say hundreds of such queries come in per second, but each with different date restrictions. All of the final results added to the cache may never be used again. There is a danger of the cache being thrashed and useful items being evicted by these useless ones. In this section below we show how we improved on the well known LRU eviction strategy to make such thrashing less likely.

“Semantic” Cache-ids and their Impact on the Full-Result Cache

In order to add and retrieve items from a cache, one needs a way of deriving keys / ids to identify them. In our case, for each MaybeCache and Join operator the corresponding cache-ids are determined from the subplan below it. This is described in detail in the next section.

Before we dive into this, let us have a look why this approach can be beneficial already for a full-result cache alone – compared to simply deriving cache-ids from the query text as is done commonly for such caches: since we are operating on the query plan, the cache-ids are “semantic” as opposed to merely text-based. Different query strings may be semantically equivalent and result in the same plan and hence same cache-ids. E.g., all comments have been removed. Ignoring comments is important for BI tools such as Looker which add metadata in comments to queries sent out.
Subsequent queries may be exactly the same, except for the metadata added in comments. The query string changes on each run, but for the FireCache this still results in the same cache-ids and therefore cache hits.

Moreover, the cache-ids are computed after the plan was fully optimized. Rewrites such as constant folding may normalize “different looking queries” to the same plan. 

We can even go a step further (as of this writing this is work-in-progress) and push the MaybeCache operator below any operators at the top of the plan which do not reduce the cardinality. Note that since we store only relatively small subresults, there is very little (or even no) noticeable overhead when evaluating that final operator on a small subresult fetched from the cache. Why is this useful? Let us look once more at our toy example query. After running it once, we may decide that we are actually interested in the average instead of the total price, i.e.:

This results in the following plan. The final Projection preserves the subresult cardinality, therefore the MaybeCache can be pushed below it.

Even though the query has changed significantly, we still get a cache-hit and the bulk of the plan is not executed. Say, we then notice that we actually wanted to order by the average cost. The Order operator is cardinality preserving as well, so the MaybeCache is pushed below it (and below the Projection). Therefore, we again get a cache hit!

Caching in Distributed Engines

In distributed engines with many servers, each server has its own, independent FireCache. For system simplicity, there is no cross-server synchronization on whether or not to use the FireCache for individual subplans. Individual servers may or may not have cached subresults available, thus some servers may answer a suplan from the cache while others may compute from scratch. Such distributed settings bring interesting challenges, described in this section below.

Cache-ids for Arbitrary Parts of a Query Plan

As a well known joke goes, there are only 2 hard problems in computer science: cache invalidation, naming things, and off-by-1 errors. Let us now look into the first problem in the context of the FireCache. We tackle it by deriving “cache-ids” for subplans which capture the “full details” of a represented subplan, including which data is scanned over. By construction of these cache-ids, we never fetch invalidated data from the cache.

To compute a cache-id which represents a subplan (say the one below the MaybeCache in the previous figure above), we start at the leaves (here the three Scan nodes) and then propagate “subplan fingerprints” up through the plan.

A subplan fingerprint needs to represent exactly the data streamed out of that subplan. The slightest change to subplan or the underlying data, e.g., a new row added to the orders table, a new or changed filter, a different aggregation etc. must result in a different fingerprint.

For each operator we implemented a specific rule of how to combine the incoming fingerprints from its children with any parameters / configuration the operator may have (e.g., a complex, nested expression to be evaluated for a Filter). 

We will skip over the details here and will only briefly touch on how this is implemented for the Scan operator. In Firebolt’s managed storage, data scanned is split into immutable tablets. If data within an existing tablet is modified, a new tablet is created, reflecting the change (note that the change may be a minimal extension, e.g., reusing all files from the existing tablet, simply adding a delete vector – for our purposes here it is important though that each tablet is immutable). The Scan operator is thus configured with exactly which tablets it should scan on which server. Usually, this is a subset of the tablets a table is composed of (e.g., narrowed down by primary-key filters or distribution of tablets across servers). So to compute the fingerprint of a Scan operator, we simply hash all tablet-ids and combine these hashes with hashes of the scanned variables. 

By construction we therefore avoid ever serving “stale” cached results for a query! To illustrate this, let us look at the example plan above once more. Say, from one run to the next nothing changed except that a single row has been added to the orders table. This results in a different set of tablets to be scanned in the leftmost Scan node – which in turn gives a different fingerprint for that node. Since the fingerprint is then propagated up through the plan, the fingerprint (i.e., cache-id) used for the MaybeCache at the top also changes => We do not use the previously cached, now stale subresult. Instead a new one is computed and stored in the cache.
Note that since nothing else in the plan has changed, both Joins actually still receive the same cache-ids for their hash-table subresults as before => These can still be reused. We hope this gives an idea of how the propagation of fingerprints nicely leads to cache-ids which enable reuse in unchanged parts of the plan while avoiding to ever serve stale data.

To make hash-collisions of the fingerprints extremely unlikely, we use the 128 bit SipHash function. To date, there are no known collisions of such SipHashes.

Consistent Subresults Matter in Distributed Settings

After so-called Shuffle operators (which distribute data across servers), one needs to be careful to not cache & reuse possibly inconsistent subresults. This is relevant for SQL queries which may give non-deterministic results on subsequent runs, even if the underlying data did not change. As an example, the following query returns 3 nations from the underlying table, but it is not defined which nations.

Similarly, SUM(o_totalprice) due to floating-point rounding may give (slightly) different results depending on the order in which the underlying double values are summed up.

Let us look at a made up example to give an idea why it can be problematic to cache non-deterministic subresults on different servers. Say we execute the following query on an engine with two servers.

Let us assume that the small 3-nations-subquery is executed on one server and the subresult is sent to both servers with a so-called “broadcast” Shuffle. Also, say that we split the evaluation of the join with customer across these two servers (each server taking care of roughly half of the customer table). We may pre-aggregate the count on these servers independently as well. In a final step the pre-aggregated subresults would be shuffled to one server and merged there to compute the final result: the count per nation for three arbitrarily selected nations.

If we are not careful, the subresults of the 3-nations-subquery may be cached independently on the two servers (note, as mentioned above we do not have any cross-server synchronization of the FireCaches). It could happen that we end up with two inconsistent copies of that subresult on the two servers – for instance, due to different evictions happening on these servers. E.g., we may end up with the three nations “GERMANY”, “ALGERIA”, and “CANADA” being cached on one server and “FRANCE”, “INDIA”, and “BRAZIL” on the other server. If we would use these cached subresults for the two parts of the join being evaluated, the final result of the query would have 6 nations instead of only 3 as expected & the reported counts would likely be off by ~50%! This is of course incorrect and needs to be avoided.

We therefore need to avoid caching possibly inconsistent subresults. To this end, we disable caching on nodes in the plan if these two conditions hold: 1) “separate copies” of a subresults may exist, i.e., post-shuffle and 2) these subresults may be non-deterministic even on unchanged data. We compute this via depth first decent in the plan. In parts of the plan where there are parallel paths, we only allow caching of deterministic subresults. 

Note that this does not mean we need to disable caching entirely above a Shuffle: if the “separate copies” are combined again into a single copy further up in the plan, from then on even caching non-deterministic subresults is OK again. In particular, the important case of caching full results is always OK, since these full results are always on a single server and there are no separate copies.

Better Eviction Strategy

As always with caches, there is a risk of “thrashing”: if many subresults are added to the cache in quick succession, important ones may be evicted by unimportant ones. I.e., subresults that are not reused in the future may cause others that are still useful to be thrown out of the cache if the memory limit has been exceeded. The FireCache is relatively large, we reserve 20% of RAM for it currently and the subresults stored are in most cases rather small (different thresholds, e.g., 1 MB for the full results). Nevertheless it is important to choose a good eviction strategy to maximize reuse and thus optimize the processing time saved by the cache.

The most common strategy is the least-recently used (= LRU) eviction policy. It is easy to implement, fast, and intuitive, since the item not accessed for the longest period of time may be the one least likely to be accessed again in the near future.

In our case we have additional information readily available which could be, but is not used by LRU: we know how long it took to compute a subresult and we know the size in RAM of the subresult. Also, it would be nice to track how often an item was used – this might also indicate how useful it is in the future.

The “NEW” Strategy

We thus combined these in the following manner to form our NEW strategy. This is essentially the same as the policy described in the section "Benefit-Based Result Selection" in [NBV13].

  • As base-weight, we combine the processing time needed to compute a subresult (and to potentially save when reusing it later) with its size as:
    time in milliseconds / size in bytes
    In a sense this gives us “how much bang we get per buck”.
  • For the frequency component we essentially sum up the (decayed, see next bullet) base-weights per access. E.g., if there were 3 consecutive accesses close in time, the final weight of an item would be ~3x its base-weight.
  • Finally, to keep an LRU-like component, we decay the weight exponentially with respect to the current vs. the last access time. For this we take a "half life time" of 5 minutes. I.e., after 5 minutes the weight of an item is decayed by half. We experimented with different half-times and the differences were small, but 5 minutes seemed like a good tradeoff for different workloads.

When the FireCache runs out of memory and needs to evict a subresult, it chooses the one with the smallest weight. Computing this is relatively expensive, but with some tuning, we got this to run in 500 to 1000 nanoseconds per eviction which is OK for such “coarse grained” caches.

Performance in Simulations with Different Cache Sizes

We benchmarked NEW vs. LRU on real-world data of several customers and compared the total processing time saved by the two strategies. Below is the result for the workload of a selected customer over multiple days. In our simulation we varied the cache size and show the percentage of time saved by the cache:

The results for very small and very large caches are not so interesting: for the former nothing can be cached and for the latter essentially everything may be cached and then reused many times by these very regular workloads – very few exact duplicates, but a lot of reuse potential due to ~100 query patterns, each with large subplans occurring many times in consecutive queries.

For the interesting mid-range of cache-sizes (4GB to 18GB) we get up to 2x more processing-time savings with NEW compared to LRU, see also this figure which shows the ratio:

Impact

Caching Hash-Table Subresults

For the interesting case of customers with mostly unique queries (i.e., few exact duplicates), stemming from tens / hundreds of repetitive query patterns, we often see very nice speed boosts of 10-100x for individual queries. This is, e.g., the case for the customer mentioned above when discussing our NEW eviction strategy. The customer’s SQL queries are often > 100 lines long and contain many joins. But, consecutive queries from the same pattern change only in one literal of a crucial WHERE restriction. Note that in this case the savings come from reusing hash tables in Join operators previously stored in the FireCache.

The chart below shows the overall hash-table build-time saved vs. build-time spent. This is across all customers in production over several weeks this year. The thick blue line gives the ratio of saved / spent. It shows that for hash-table build times we overall reap 10x savings on many days with some spikes being even higher.

Full Result Cache

The MaybeCache added to the top of the plan has obvious advantages. As mentioned many other systems add a similar cache (although to the best of our knowledge without the full benefits of the “semantic cache-ids”, see above).

It is interesting to see for individual customers how much they may profit from such a full result cache. The answer is some customers can profit tremendously: here is a chart with the simulated time savings of the top 15 customers (anonymized), ordered by the potential savings across ~5 days of production queries. I.e., customer “A” may save > 95% of their total compute time if queries would be answered by a full result cache. Note: we will redo this measurement to report actual time savings when the full result cache has been in production for a longer period of time (this was launched only recently).

Conclusion

This blog post gave a deep dive on how at Firebolt we give our customer’s queries a speed boost by enabling caching and reuse of subresults. For some of the interactive use cases we serve, this is of crucial importance. Without caching results of complex subplans certain query patterns would not achieve the low milliseconds latencies required.

As shown in the “Impact” section above, this also in general yields tremendous savings for our customers, e.g., by saving up to 10x in processing time spent on building hash tables for joins.

We hope we were able to give an idea on how with propagation of fingerprints & derivation of cache-ids we can determine in fine-granular manner which parts of a plan can be reused and which should be recomputed. We dove into the challenges posed by distributed settings. Finally, we gave a quick overview of a modified eviction policy to improve the usefulness of the FireCache under heavy loads when many items are being added (and conversely need to be evicted).

Since our cache is purely in memory for now, compact representations of data structures to store subresults are very beneficial. In a follow up blog post, we will describe in depth a custom, space-optimized hash table which we devised for our novel “FireHashJoin”. With it, we achieve more than 5x memory savings (as measured on production data) compared to the hash table implementation we used before. It thus enables us to cache significantly more of such subresults in the same amount of RAM.

Read all the posts

Intrigued? Want to read some more?