Almost all cloud data warehouse vendors claim to be the fastest. The thing is, seamless, high performance experiences tend to come with a trade off that no one likes to talk about - costs. The process of understanding cloud data warehouse costs is not straightforward, as they are dependent on different parameters including the data, queries, user concurrency and the performance requirements. In the following article we’ll review the most common pricing models, their pros and cons and which use-cases they are most relevant for.
The ‘Pay Per TB Scanned’ model includes storage and executed query costs. This means that pricing heavily depends on usage and the size of your workload.
For storage, BigQuery charges $0.02 per GB. In Athena, data is stored in S3 and users pay more or less the same. On top of that, in both platforms users pay $5 for each TB scanned while executing queries.
In the case of BigQuery, you can move to a flat monthly rate of $10,000 (or $8,500/month if billed annually) for every 500 slots, which are basically compute units. This brings BigQuery down towards Snowflake and Redshift prices for regular-use analytics.
Per TB scanned billing means users really do only pay for what they use at the moment, and not for any idle time. If your analytics are more “one-off” and not regular daily tasks, this model is a great option.
If you are just getting started, Athena can be a great “getting started” option for running against an existing data lake, and is also super user-friendly, which can save costs spent on resources and developers. That is of course, until your needs grow.
This pricing model can make users very cautious before running a query, because the query might cost too much. Almost every organization has encountered the same painful scenario - a user makes the grave mistake of running a query over a massive data set, which results in a big fat dollar amount for that single query. This usually leads to throttling that prevents people from asking “expensive” questions, or requires them to ask before they run large queries. But throttling defeats one of the main goals of the analytics infrastructure; to make analytics more accessible.
Users should also be aware of platform differences on how the TB scanned metric is calculated. In BigQuery, it is based on size of data type in each column and not based on compressed data scanned. Athena, on the other hand, uses compressed capacity scanned which is more cost efficient. Point is devil is in the details.
A customer who was averaging $1800/month with this model, experienced a jump to $21K+/month, all because the analysts unleashed a bunch of experimental queries. This introduces a lot of unpredictability.
Bottom line, while this pricing model can work very well for infrequent analytics against small-mid sized data sets, it can become the most expensive option for repetitive big-data analytics where a lot of data is scanned frequently. If you have lots of data and users performing analytics, this may be the worst model for you.
Both Snowflake and Redshift offer on-demand pricing models as well as 30%-70% discounts for prepaying. The cost of these platforms depends on how much you use them, performance requirements and dataset sizes.
Redshift charges per-hour per-node compute. With dc2 nodes storage is coupled to compute while with RA3 nodes storage is offloaded S3. To calculate costs, multiply the price per hour for the selected node by the cluster size and number of utilized hours - [price per hour] x [cluster size] x [utilized hours]. With RA3 nodes, Redshift managed storage is charged at S3 rates. Keep in mind, your cluster can be formed only using the same type of nodes, which makes it hard to adjust to dynamic use-cases. Redshift only offers a limited number of instance types, hence switching between node types will result in a step jump in costs.
Snowflake decouples storage and compute, and charges separately for them. Compute pricing has 4 different tiers - Standard ($2), Enterprise ($3), Business-Critical ($4) and Virtual Private Snowflake (undisclosed but more expensive) - per node. While the tiered model might sound simple, you might end up with a much higher tier to get a single feature you need. For example, to add materialized views you will need to adopt Enterprise tier. If you need private link capabilities, you will need to switch to Busines-Critical tier. You can choose from 8 different virtual warehouse (cluster) sizes for compute - 1, 2, 4, 8, 16, 32, 64, 128, 256 or 512 nodes. Doubling the warehouse size doubles the cost. The largest in this case is $2048 per hour, $4 x 512 nodes for Business-Critical Edition, which is around $18Million annually 24x7. The actual charging is billed per second.
Storage pricing is $23 per terabyte per month with an annual commitment, or $40 per terabyte per month on-demand. Even though we chose to put Snowflake and Redshift under the same category, there are critical differences between them. While a benchmark might show that Redshift is cheaper than Snowflake, Snowflake is easier to manage. A true cost comparison needs to take into account your true data size, compute, and administrative costs.
When working with relatively small workloads and when usage is manageable, this flexible pricing model can be very cost effective. By monitoring your performance and usage needs, you can turn compute resources on and off to control costs. Another advantage associated with Snowflake is its ease-of-use. While Snowflake is not considered a cheap platform, it provides a friendly user experience which in the long run can save costs spent on manpower and resources.
Working with large data sets, complex queries or high user concurrency brings out all the deficiencies of most cloud data warehouses. While most cloud data warehouses help simplify administration and scalability, they are not good at improving performance or costs as you scale. The simple answer to improving performance has been to scale compute, which makes more money for the vendor. There aren’t enough options to tune for greater efficiency, or choose different node types and numbers for different workloads.
In the case of Snowflake, for example, you can only keep doubling your virtual warehouse size to scale for increased query complexity, or add more identical virtual warehouses to scale for increased user and query concurrency. This has been referred to as “credit fever”.
This pricing model addresses the conflict of interests described in the previous section. With optimized storage and indexing, Firebolt has made it its main priority to enable higher performance with fewer compute resources, while still maintaining the convenient pay per use model.
Users can select engines from a variety of instance types optimized for different kinds of workloads, scale up or down from 1-128 nodes on the fly. By leveraging sparse indexes, Firebolt prunes data down to the block level, reducing the network bandwidth needed and hence the size of the compute instances needed to process data. Firebolt also provides the ability to autostop, scale-up/down/out/in the engines reducing compute spend.
Firebolt stores data in a sorted, indexed, compressed format on cost effective object storage enabling TCO savings while meeting performance needs.
Firebolt provides the most efficient pay per use pricing model, which enables users to reach higher performance with less resources. Pricing is completely transparent - Users can always see the rate of the resources they use directly in the user interface and they don’t have to be certified accountants to understand it.
The new approach is especially useful for:
While pricing models tell the story of a specific vendor offering, there could be hidden costs that impact the bottom line.
While consumption model is a great place to start, be aware of potential hidden costs which crop up when you least expect them. For example, there are options to run open source software like ClickHouse and Druid in the cloud. These technologies are purpose-built, however, they incur additional costs in various forms. Need for specialized block and file storage options can result in higher costs that continue to grow as your data grows. Inability to isolate environments can result in creating multiple copies and hence multiplying costs. Tasks such as cluster rebalancing, storage optimization etc can require downtime and involve additional specialized technical staff to manage operations. Given the high demand for technical skills, factoring in the undifferentiated heavy lifting in the form of operations is important.
So what’s the right platform and pricing model for you? Key points for consideration:
On a personal note, why should costs stand in the way of an amazing insight? A good solution encourages you to run more queries, on more data, by more users to get more value, and all that without the cost trade off.