March 26, 2025

March 27, 2025

The Process of Running FireScale Benchmarks

Developer Advocate

March 26, 2025

March 27, 2025

The Process of Running FireScale Benchmarks

Developer Advocate

No items found.

Listen to this article

Powered by NotebookLM

Listen to this article

Let’s get straight to the point: benchmarks are a contentious topic. The results are full of promises and assurances, but simultaneously riddled with hidden decisions, biases, and qualifications in small text that should make a wise audience skeptical. It’s good to have a healthy dose of distrust and skepticism in regards to benchmarks: I’ve even written a blog about the reasons why I dislike them¹. But that blog also concludes that they are a necessary evil. Performance matters a lot, and though investing many hours of your own time and effort to gauge the performance of all possible options up front would be the best way to understand the truth, few individuals or businesses have that much time. So here we are investing that time for you, and here you are, interested in hearing how we did it.

Before getting into the weeds, there’s something worth stating up front: the queries, data, and results of the FireScale benchmark are all accessible in the Firebolt benchmarks repo on GitHub. If you’d rather get hands on, ingest the data, run the queries, and look at all the results yourself, go right ahead. If you’re looking for context before you embark on that task, then there’s a lot of context to discuss.

This blog’s goal is to walk you through the decisions we made and the process we followed in order to get the FireScale benchmark results, as well as to provide justifications for some of the decisions we made. I'll also talk about the results some. If I do my job well, you’ll understand where we’re coming from and take our numbers with the correct dose of salt. By walking you through what we did, it’ll hopefully become clear that there were no niche optimizations or hyper-specific tweaks being made that will deviate away from the real world. Our goal from the start was to make something real, meaningful, and reflective of reality.

Introducing a New Benchmark

Your first question may be to ask if the world needs another benchmark. Firebolt’s Head of Product Management, Hiren Patel, wrote a blog about the process and the intention behind creating FireScale back in September 2024². His blog details why existing benchmarks are unsatisfactory, why we set out to make our own benchmark, and the methodology that went into selecting our queries and data. So that you don’t have to read another blog as a prerequisite for this blog, I’ll discuss the highlights here, but it’s recommended you give that a look later if you’re curious and want more information.

Other benchmarks don’t cut it

TPC-H and TPC-DS are widely considered to be the industry-standard benchmark for any kind of SQL analytics, but TPC doesn’t do a great job of representing what data and queries look like in reality. The range of query complexity is relatively narrow, and it especially does a poor job of reflecting the more sophisticated queries that users have been crafting and submitting in the 20+ years since the TPC benchmarks were introduced. Amazon has written a very detailed breakdown of the discrepancies between TPC and reality³ in regards to the queries they see on Redshift, and we agree with their conclusions. That said, if you like TPC-H, we’ve provided results and queries for it in the benchmark repo.

The data and queries of FireScale

FireScale’s data is built on top of the Berkeley AMPLab Big Data Benchmark dataset at a scale of 1 TB. We also added three more dimension tables on top of it, as real-world workloads tend to have more than the two tables that exist in the default AMPLab dataset.

With this data, we devised queries based on the queries we saw Firebolt users running in production. We analyzed our customers’ workloads, from the simplest to the most complex queries, and we anonymized and modified the query patterns to run on the AMPLab data.

The goal of these two steps was to create a benchmark that truly reflected real-world behavior:

These are real queries that are being run by real businesses, on a dataset that models web traffic in a way that is comparable to the real world.
At a 1 TB scale for the largest table in the dataset, it’s large enough to be a meaningful test for most systems, but not so large that it becomes difficult for an individual to run and validate the results we generated.

We intend on following up this initial effort with FireScale at 100 GB and 10 TB scales, so stay tuned for that.

Setting up FireScale

Deciding how to set up, run, and present results of a benchmark involves a ton of questions where there isn’t always an obvious answer.

Selecting vendors

We set out to provide a benchmark for cloud data warehouses. This meant that the obvious candidates were Snowflake, Redshift, and BigQuery, as they’re the three largest cloud data warehouses on the market. We nobly chose to include ourselves, too. We didn’t exclude or leave out any other vendors with intentionality, though limiting the initial effort to a handful of vendors was necessary to keep the scope reasonable. If you’re reading this and desperately want another cloud data warehouse benchmarked on FireScale, you may be able to modify the queries and do it yourself, and otherwise, reach out.

Loading data

The source data sits in an S3 bucket: s3://firebolt-benchmarks-requester-pays-<region>/firenewt/1tb/, with all five tables sitting in Parquet files in their own folders. We ingested the data into every vendor from that bucket, setting up the same indexes on the same tables for each vendor. In Snowflake and BigQuery, this is in the form of clustering keys. In Redshift, it’s sort keys. In Firebolt, we just call them primary indexes. But ultimately, the result is the same: data in storage is sorted by the same columns on the same tables.

We also manually ran VACUUM operations where we could, but with some table optimizations happening in the background on other vendors, we also let the data sit there for a while so any background optimizations would have time to run. There wasn’t an intentional amount of time waited, but the final runs we’re publishing results from were done at least a month after the data was ingested, so anything happening in the background certainly had time to happen.

Single queries vs concurrent workloads

How fast a single query can run is often the most scientific way to measure the performance of a system. You control for as many factors as possible, give all the resources of a system to that query, and you get the truest understanding of how fast that query runs in a vacuum. But in the real world, most systems aren’t running one query at a time. You have multiple users submitting multiple different queries, and how that’s handled is more important for tracking performance than the idealized scenario of one query at a time.

Which of these paradigms do we try to model when creating and publishing a new benchmark? Well, the answer’s already out there, and TPC does it: you do both. We split the benchmarking and analysis into power runs and concurrent runs - one for sequentially running each query, the other for running many queries simultaneously. With power runs, you can measure price-performance by evaluating how long it takes a vendor to run through a defined set of queries. With concurrent runs, you can measure price-performance by seeing how many queries it can churn through in a set amount of time with a constant load. They both have their merits, and so we chose to do both.

More technical considerations

One essential consideration before getting started was that we needed to run the benchmark from the region where we’d set up all of our cloud providers. Every vendor account had been set up in Ashburn, Virginia: us-east-1 on AWS, and us-east4 for BigQuery, which is quite literally across the street. Our initial test runs involved submitting queries and measuring end-to-end query duration from a laptop in Kirkland, Washington, where it became clear that for the faster queries in FireScale, we were mostly testing a WiFi connection and how fast data could travel back and forth across the continental United States. We set up our formal benchmark runner on an AWS EC2 instance in the us-east-1 region, which shrank the >60ms cross-country WiFi + network delay into microseconds that wouldn’t impact the results.

From there, we could connect our client on that EC2 instance to the relevant vendors and use it to submit queries. For Snowflake, BigQuery, and Firebolt, this came in the form of the vendor-provided connectors/drivers/SDKs for Python and Node.js. For Redshift, this was psycopg2 and pg, because the AWS-provided Redshift clients are built for managing clusters, but not for submitting and running queries.

Before we started running FireScale queries, we also wanted to make sure that the engine was running and warmed up with the dataset. This came in the form of calling a hash function on all the data in each table… except for the uservisits table, because trying to hash all columns in it on Redshift either errored out or took hours. In the interest of fairness and ensuring that each vendor was being treated equally, we chose to exclude that warmup on Firebolt and Snowflake, too. We also didn’t do any warmup queries on BigQuery at all, as it doesn’t have any caching mechanisms besides a pure result cache, so it wouldn’t accomplish anything.

Redshift

While on the topic of Redshift’s lack of official Python/JS clients and the limitations it imposed during warmup, it’s worth mentioning that setting up, managing, and running queries on Redshift consumed a disproportionate amount of the time and effort involved in setting up and running FireScale. With no auto-start and auto-stop functionality, it took a lot of manual management, and its cluster start times meant waiting 5-10+ minutes to get a run started. Where we could run hash_agg(*) in Snowflake and Firebolt to warm up the caches prior to benchmark runs, Redshift required a less-performant cascade of:

SELECT FNV_HASH(col1, FNV_HASH(col2, FNV_HASH(...)) FROM table;

‍

Testing different Redshift node types and node counts either required resizing our cluster (slow) or re-ingesting our data into a new cluster each time (very slow). We could have used Redshift data sharing, but that required keeping the original cluster online (more expensive) and didn’t come with any clear assurances of performance being equivalent to as if the data was in that cluster. Taking unnecessary steps that potentially limit performance is a very bad thing to do when running a benchmark, so we avoided it and suffered the consequences of needing to wait 5-10 minutes for each cluster resize to be safe. Except on occasion, a resize failed because nodes were unavailable, or a snapshot of our cluster hadn’t been made yet, or we realized that for some reason, the particular node type we wanted 1, 2, 4, and 8 of was only available in clusters of exactly 3, 4, 6, or 12 nodes.

BigQuery, and the issue with benchmarking serverless systems

BigQuery also comes with a big caveat, because BigQuery is serverless.

For those unaware, serverless billing and infrastructure models are radically different from the start-and-stop, uptime-based models of the other three vendors we were testing. They often come with some big benefits:

There are no start or stop times, everything can run immediately
You have no hourly or rate-based charges
There are no cluster or server management logistics to worry about
You’re only billed for the exact amount of resources you use as you use them

This is incredibly user-friendly. But everything is a tradeoff, and because serverless providers need to turn a profit, there are some key downsides:

Much higher cost when in use
More complexity and opaqueness when trying to determine cost up front, especially on the scope of days, weeks, or months

Pay-as-you-go serverless pricing models are excellent when you expect very low usage. The steep rates you pay don’t matter when you’re barely using them in the first place. But the more you use them, the worse they are. In a benchmark, you’re measuring a scenario where the system is running 100% of the time, so a serverless pricing model is going to look much worse. The performance results are still meaningful, but the price-performance ratios end up disproportionate.

Running Power Runs

Now that all of that is out of the way, we can start running the benchmark.

For power runs, where we run each query sequentially, we used a Python client that measured end-to-end query duration. The client disabled query result caching for each vendor, ran the warmup as outlined earlier, then ran each of the 25 queries 5 times, for a total of 125 queries.

When doing analysis on power run data, we chose to look at the median query latency of runs 2-5 for each query, aka the warm runs. While the first run, aka the cold run, is a meaningful datapoint, a true cold query is not something that happens in a production environment very often. Every query in FireScale is reading from the uservisits table that we didn’t access at all in the warmup, so the cold run is fully cold. Counter to that, for Firebolt’s users, 97% of all queries are reading data from a cache in some way, accessing data that has already been accessed while the engine has been running. We wanted to emulate that.

By disabling the result cache, you’re ensuring that the query engine still has to do work to determine query results, but by using the warm runs of each query, you’re representing the scenario in which the cache is warm and the data does not have to be read cold over the network. This approach benefitted Snowflake, Firebolt, and Redshift, though it did not benefit BigQuery, as it does not have any caches to warm up.

To test further configurations of each system, we shut down the engine/warehouse/cluster, resized as needed to test various sizes and configurations, booted it back up, and restarted the process again.

Of course, if you’re interested in looking at all of the data, you can find that in our benchmark repo on GitHub.

Running Concurrent Runs

Testing concurrency is a little trickier. Our initial naive approach was to run as many threads as desired, with each thread randomly iterating through the 25 queries using the same Python setup we’d used for power runs. Both of those were bad ideas.

For queries, the issue is that it just doesn’t make a ton of sense to run the same 25 queries over and over again, nor does that approach reflect reality. The usage we’ve seen in the real world in Firebolt is that while there are complex queries, the vast majority of the queries being submitted in highly-concurrent scenarios are simple, lightweight, and extremely quick. FireScale has about 10 queries that might meet that lightweight criteria, while the other 15 are more meaningfully complex. Beyond that, submitting the exact same handful of queries thousands or tens of thousands of times isn’t how the real world works. We narrowed it down to five query patterns through a vendor-agnostic analysis of adding up each query’s duration for each vendor, then selected the five fastest. We generated 10,000 variations of each of those five query patterns, and then we ran the concurrency benchmark by randomly working our way through all 50,000 of those queries. We actually generated 10,000 variations of all 25 queries, and it’s a planned future improvement to FireScale to assign realistic weights to each query so that the more complex queries are sprinkled in at a ratio that makes sense.

For the client being used to benchmark concurrency, Python was simply too slow to saturate scaled-up engines in Firebolt and Snowflake. We pivoted to Grafana K6 and the Node.js clients to get better performance and more robust multi-threading support, and then we were off to the races. The runs were done with a 5-minute warmup and then a 1-minute run to actually measure queries per second (QPS). That sounds brief, but when most configurations we tested were operating north of 100 QPS and a few were above 1000, we’re talking about a sample of thousands or tens of thousands of queries in that 1 minute.

We also chose not to benchmark concurrency on BigQuery, because its serverless model would mean that both performance and price-performance comparisons for concurrent workloads wouldn’t make much sense. Because you pay a fixed price for each query and receive more resources from BigQuery as you run more queries, running 10 queries at once would see the price multiplied by 10. If 100% uptime was unfair to a serverless pricing model, try 1000% uptime. The queries would also have no impact on each other’s performance, limiting the insight one could gain by running them concurrently.

As with the power run, you can find full concurrency run data in our benchmark repo on Github.

Results and Analysis

With the benchmark run, the numbers dumped into spreadsheets and uploaded to GitHub, and the full methodology laid out, the only thing left to do is talk about the results. We’ve shared some flashy headlines and drawn up some comparisons where we look at one specific configuration next to another. With the full results and a couple cool graphs, you should be able to pick the comparisons that make the most sense to you.

Power Run Results

You’ll see us promoting the conclusion that in order to match Firebolt’s performance on an engine with a single S node, you’d have to spend 37x more in Snowflake on an XL warehouse. The simple truth is that of the platforms we tested, none of them can match Firebolt’s performance. Firebolt is the fastest, and it’s by a significant proportion. But let’s look at a few more practical comparisons.

If you’re cost-conscious, you’d use a Snowflake XS warehouse or a Firebolt engine with an S node. If you’re performance-conscious, you’d use a Firebolt engine with XL nodes or an L/XL Snowflake warehouse. If you want to strike a balance between performance and price-performance, you might look at a Snowflake M warehouse and a Firebolt M node type. Because Firebolt is both faster and less expensive, all of these comparisons are favorable for it, but you can see how these different priorities and comparisons adjust the potential conclusions:

Snowflake Warehouse	Benchmark Runtime	Hourly Cost	Firebolt Node Type	Benchmark Runtime	Hourly Cost	Firebolt Price – Performance Advantage
XS	62.97s	$3.00	S	16.77s	$1.40	8x
M	24.98s	$12.00	M	10.00s	$2.80	11x
XL	18.33s	$48.00	XL	6.00s	$11.20	13x
XL	18.33s	$48.00	S	16.77s	$1.40	37x

‍

Rather than continue to draw up different comparisons, I’ll present the graph you saw at the top of this blog, and which covers most of the different configurations we did power runs on:

BigQuery, absent from this chart, is directly down from the Snowflake XS datapoint, but falls so far off the bottom that it would be somewhere in the middle of the next paragraph.

What this chart is telling you is that every single Firebolt configuration can run this benchmark faster and at a lower price than the other vendors we tested. You can see a relatively clean scaling curve for each vendor, where adding more resources speeds things up to a point, and because running faster means saving on cost when there's a set amount of work to do, price doesn’t jump up too quickly. You also see diminishing returns in each vendor when you continue to add in more resources, because for some of the queries in FireScale, the excess resources aren’t able to be leveraged, and that turns into wasted capacity for the simpler queries.

Concurrent Run Results

For concurrency, I’ll start by presenting a similar but much less pretty chart, then I’ll explain it:

This represents how much QPS you can get by adding additional capacity to Snowflake, Firebolt, and Redshift. For Firebolt, this demonstrates the power of adding more clusters to a single M node, compute-optimized engine. For Snowflake, this shows adding more clusters to an S warehouse. For Redshift, it’s adding more nodes to an ra3.4xlarge cluster.

You can clearly see Redshift struggling to scale past a certain QPS threshold, while both Snowflake and Firebolt are able to scale near-linearly by adding more clusters. However, Firebolt’s price and performance advantage from the start means that it can scale to much higher query throughput for a lower cost.

Conclusion

We tried our best to do things right, reflect the real world, and create and run a benchmark that can be trusted and validated. The end result from FireScale that we see is that Firebolt is faster, less expensive, and overall more powerful than the competition.

But there’s a huge disclaimer that still needs to be made: your data and your queries are not this data and these queries. You can adapt our code in the benchmark repo to add your own benchmark, and then use the same client to run and test it. In a lot of scenarios, we’re confident the results will be similar. Firebolt is unprecedentedly fast compared to other cloud data warehouses, and we’re excited to get it into your hands.

Sign up today, get your $200 in free credit, and give it a spin. Maybe go run FireScale yourself.

¹ https://www.linkedin.com/pulse/problem-benchmarks-cole-bowden-vfcvc/

² https://www.firebolt.io/blog/high-efficiency-and-low-cost-concurrency-in-action#benchmark-methodology

³https://www.amazon.science/publications/why-tpc-is-not-enough-an-analysis-of-the-amazon-redshift-fleet

Table of Contents

This is some text inside of a div block.

This is some text inside of a div block.

Get Started for Free

Read all the posts

Unlocking Simplicity and Security: Firebolt’s New LOCATION Object

Discover the new LOCATION object, a foundational improvement to Firebolt’s data access model.

Chen Burshtein

How Rising Wave Is Redefining Real-Time Data with Postgres Power

The Future of Data Processing: PostgreSQL Evolution with YingJun Wu of PingCAP.

Firebolt Team

GROUPING SETS as a pure planner rewrite ? Yep - it's possible

Learn how GROUPING SETS work and how Firebolt’s implementation uses smart query planning to execute them efficiently.

Julia Spindler

Intrigued? Want to read some more?