On November 13, 2020, Amazon announced general availability of a new query engine version, Athena engine version 2. Upgrading is not a big deal. It’s a SaaS serverless query engine. But read the doc and the breaking changes to make sure you won’t have issues.
What is Athena 2.0 vs Presto? Athena 2.0 is built on Presto 0.217, which was officially released Feb 21, 2019. Yes, that’s a few years ago now, and yes, Presto is not labeled a “1.0” product yet. Is that a problem? Not really. Athena 1.0 was built on 0.172 and that was stable. In fact, if you move the decimal point on Presto for 0.172 and 0.217, you go from 1.x to 2.x. Problem solved.
Athena does add other features on top of core Presto. Athena is serverless; all you have to do is configure Athena to use your storage and then send it queries. Athena does everything else for you. With Presto, you need to completely roll your own deployment, from the underlying infrastructure, to installation, to development, deployment and management of all your compute.
The real question is what’s new and big enough to call Athena 2.0 a “major” 2.0 release?
In some ways, I would argue it’s not. It strikes me more as a 1.5 release. I say this because this is not the major rewrite release we’ve come to expect with 2.0 releases. It seems more like a stable release with new features. Presto hasn’t had major changes either.
But it is an important release you should be using if you’re an existing Athena user. There are several Athena 2.0 performance improvements, as well as feature gaps that have been filled outside of core Presto that will really improve your deployment.
If you read through the announcements, here is what Amazon highlights as the major new features in Athena 2.0:
- Performance improvements - First, and perhaps foremost, there are a host of performance improvements Presto brings to joins, aggregations, and subqueries. There are also new features with lambda functions that are important, even if they’re not major new features. Performance is more important than ever before, especially when it comes to newer operational and customer-facing analytics.
- Federated queries - Yes, you’re reading that right. Athena was already a federated query engine. But this latest release expands the sources you can query across, including non-AWS sources, using Lambda functions as connectors.
- Geospatial functions - There are a lot of new functions. The investment in more geospatial is not surprising given that operational analytics do often involve monitoring the location and other attributes of people, mobile devices, or the Internet of Things (IoT) in general.
- Nested schema - Expanded support for nested structures in Athena is a great feature. We have more semi-structured data like JSON than traditional relational data. Nested schema helps represent semi-structured data more naturally. It is far better to use nested structures and dot or bracket notation for certain types of queries than to be forced to flatten data and use nested queries for everything.
- Schema evolution support - These improvements are a great addition. Schemas can and will change unexpectedly, especially when the data sources and federated query engine(s) are managed separately. Your architecture needs to be flexible to such changes, which makes support for schema evolution a requirement.
These features are all great improvements. But if you look a little deeper, you will find a few other very specific features you should be pretty excited about, and others you should hope for.
The first one, which only gets brief attention, is spilling to disk. Spilling to disk? Really?
Yes. Really. Just hear me out. If you look at why Athena errors happen, it is often from errors that occur when an individual node runs out of RAM. Spilling data from RAM to disk prevents those failures by allowing you to process much more data than what you can hold in available RAM. It’s a trick in-memory data grids and other RAM-centric technologies have used for years. You will probably take a performance hit once you start to spill. It might even lead to 30 minute timeouts in some cases. But this one feature, combined with the various performance improvements above, could lead to much more scale and far fewer query failures with Athena 2.0.
Then there is the feature that might be there but isn’t mentioned; larger nodes. Because the amount of RAM is the main limitation, and because instance types are constantly improving, one would think nodes are being upgraded to something faster and bigger. If someone has more information on this, please get Amazon to speak up. Even if they just give the relative growth, it will help companies a lot. I also think several companies will pay more for bigger nodes. Feel free to pass that onto Amazon as well.
Some of these features would be great with Redshift Spectrum as well. I really like Redshift Spectrum as an option for Redshift customers, much more so than Athena, because you get dedicated Spectrum nodes - up to 10x the number of your Redshift nodes. It’s like having an integrated virtual private Athena. I am periodically reminded by others that Redshift Spectrum and Athena are different. But I think everyone agrees if you use Redshift, you need to use Redshift Spectrum. I hope many of these features make it to Redshift Spectrum.
Finally, there is the feature most people want; better scaleout through query optimization and other efficiencies. Presto has been doing a great job of improving performance through query optimization and caching. The fact that its performance is close to Snowflake, Redshift and Athena in benchmarks by Fivetran, despite the fact that Athena 2.0 does not manage its own storage, is a testament to that optimization.
Beyond the optimization Presto has been adding these last few years since 0.217 was released, which would be a great Athena 2.1 release, the biggest improvement would be adding ingestion and specialized storage. We all know companies are willing to restrict ingestion and storage to improve performance. That’s why they buy specialized data warehouses like Redshift in the first place.
This combination of technologies seems inevitable. While there will be a subset of groups who want just federated queries or “occasional analytics”, all companies need a data warehouse and also want the ability to access multiple data sources. Sometimes changing your data pipeline and warehouse schema takes too long, and sometimes the performance doesn’t matter.
Yes, this would make Presto a data warehouse, not just a federated query engine. Yes, it would compete with Redshift by creating a single product with at least some of the functionality of Redshift and Redshift Spectrum. This could lead to a possible collision course between Redshift and Presto. Amazon is clearly headed in that direction with Redshift and Redshift Spectrum. AWS has also clearly been positioning Athena for non-data warehouse use cases. It may well remain in that position.
But who knows, in time, this level of optimization might come in Athena. AWS is known for making sure whatever customers want that it runs on AWS. So if Athena does deliver a serverless data warehouse + federated query engine, I would happily call it a 3.1 or 4.1 release, because it would be a game changer.