Part 2 - the right ad hoc analytics architecture
In part 1 - the 4 requirements for an ad hoc analytics architecture, we explained both the 4 requirements for supporting ad hoc analysis, and why most data warehouses and analytics fail:
- Optimize for performance
- Add any new data at any time
- Perform any ad hoc analysis with no planning
- Run any ad hoc analysis at any time
If you do not understand or agree with the requirements, go back and read about them first. If you do, and especially if you’ve suffered because of the wrong architecture or technology in the past, then you may already have the right architecture in mind.
You can also watch the webinar: How to survive ad hoc and interactive analytics with big data, which goes through some of the challenges and what’s below.
The goal is to have all data loading and analysis could be completely done by an analyst and not impact others. If an analyst is going to be able to get their own data and perform their own analysis, you need the following:
- A data lake for self-service access
- A data warehouse for self-service ELT and ad hoc analysis
These do not need to be the only components of an hoc analytics architecture. But they are both needed, and need to work together.
A data lake for self-service access
When I say you need a data lake for self-service access, I still sometimes get a skeptical look. Some analytics veterans who have gotten burned trying to run Hive, Presto or some other engine on Hadoop are skeptical about data lakes. Others think data lakes already support self-service access.
Just let me explain.
Data lakes were created in part to provide self service and get around lengthy ETL and data warehouse deployment times. The idea was to create a place where analysts could get at the raw data, and transform the raw data later, as needed. It was James Dixon, who was at Pentaho at the time working on data integration and ad hoc analytics, who first coined the term data lake.
If you have built a true data lake, you are almost there. By data lake I do not mean a data warehouse like Snowflake holding all your data. I also don’t mean Hadoop, which may have an implementation of a data lake with compute. I say this with some respect because I also had to go back to James and ask him just to get it clear in my head.
A data lake is a place where you store your raw data. If you have taken the time to make your data lake your source of the truth, then you need to make sure you can quickly pull any data for any need, including ad hoc analysis.
A data warehouse, even a modern cloud data warehouse, is not a data lake. For one, a data warehouse is storage and compute. While it is technically a decoupled storage and compute architecture, you can’t use the storage without the compute, and the storage is a proprietary format. In addition, data in a data warehouse is not raw. It is very structured as relational tables, with limited or no support for semi-structured data.
A data warehouse will also force you to limit data access. Why? Because it’s expensive. You will be charged every time for the compute needed to add or extract data. Most companies limit cloud data warehouse usage because compute is expensive. So they do things like export Tableau extracts. But restricting usage defeats the whole purpose of a data lake.
If a data lake is to support true self-service, then the data must be organized, cleansed, and easily accessible by an analyst. If you want to understand what it might look like, go read Alex Gorelik’s The Enterprise Big Data Lake. Alex not only goes way back in reporting and analytics. He’s a governance expert. It will give you a sense of where you might end up as you layer on governance over time.
The key takeaway is make sure you have an independent data lake that is your single source of the truth, one that you can change over time. Because it will change and may get replaced.
A data warehouse for self-service ELT and ad hoc analysis
Once you have an easily accessible data lake with the right data, how should an analyst access and analyze the data? That is the hardest question to answer. One approach people have been taking is a fast “self-service” data warehouse that data engineers can use with an analytics tool to perform ad hoc analysis. Firebolt fits into that category.
A self-service data warehouse for ad hoc analytics requires:
- Decoupled storage and compute
- Self-service ELT
- Self-service optimization (the latest magic)
- Sub-second query performance at scale
- High performance joins
- Fast groupby and filter operations
- Support for semi-structured data (JSON)
- High user/query concurrency
Decoupled storage and compute
In order to allow new data and analytics to be ingested and deployed quickly without testing, you need to isolate it from the other analytics. That is exactly what a decoupled storage and compute architecture does, and what has been missing in most analytics architectures.
Data warehouses can isolate ad hoc analysis from other workloads by simply running them on different clusters that fetch data as needed. If you can improve performance and lower costs, they are a viable solution.
Specialized analytics servers like Druid cannot isolate easily. They have to load a full copy of the data into a new cluster with coupled storage and compute. Because these clusters are not automatically elastic or decoupled, they are often too expensive to run separately for each workload.
Self-service ELT
ELT and analytics have to be together as well. An analyst needs to be able to:
- Choose any new data
- Load it into the data warehouse
- Transform data
- Optimize performance
- Optimize queries
If an analyst or data engineer can do all this together without impacting other workloads, and hand over the right SQL to something like Looker, Tableau, or even a REST API, you have self-service ELT and analytics.
Take Firebolt as an example. An analyst or data engineer can write standard SQL to:
- Read any format from a data lake (Parquet, ORC, …) as a external table
- Access any relational or semi-structured data (SELECT)
- Transform it (e.g. aggregations such as SUM, COUNT)
- Write it into Firebolt
- Optimize by defining indexes in SQL (more on that later)
- Write SQL queries (e.g. for APIs) or query directly (via JDBC)
The SQL for ELT, and the SQL for analytics can be deployed by the analyst/engineer to any existing or new clusters of any size, separately or together. In other words they can deploy completely isolated ELT and analytics quickly and on their own.
The use of SQL for ELT, as well as the use of SQL-based tools such as DBT, is proven. It has become part of many modern data pipelines. Those examples show you can have more self service and faster ELT deployment times. When deployed on top of the elastic computing, you can make load times really fast as well. With some like Firebolt, it can be continuous ingestion.
Self-service optimization (the latest magic)
I know it may seem like I’m being overly dramatic, like David Copperfield, to call this magic. I am. Sorry. I think it’s important, just it was important when data lakes first came out, to be dramatic about the most important new component.
Here’s the magic trick.
- Each time you want to do a new ad hoc analysis, just declare what you need in SQL.
- Each time you need to load new data, load each new fact table (or dimension table) to its lowest level of granularity, and declare the primary index.
Now, how does this magic actually work? With mirrors and strings, otherwise known as materialized views and indexes, along with a query optimization engine that automatically knows how to optimize and when to use them.
Here are the first steps:
- Whenever you need to analyze some data, you define the table and add as much detail as you might need as a fact table with the needed foreign keys for joins. An organized data lake is important because it helps ensure the relations can be easily identified using SQL.
- For each fact (and dimension) table, you define a primary key that is a composite for the important columns you will use in queries of the raw data.
Primary indexing can actually deliver enough speed for the majority of queries. But if it needs to be even faster, you can do the following:
- If you need faster joins, define join indexes across any two tables using DDL. You can do this all upfront or as you need them.
- If you need faster groupby, aggregation or filter operations, which might require aggregating data, just define a materialized view, or aggregating index in Firebolt, with all the data and a different primary key and sort order and partition scheme.
With the right architecture you can declare what you need as you need it and not have to maintain it. Firebolt, for example, automatically updates all indexes, including aggregated views, during each row-level write.
This trick is only possible with a decoupled storage and compute architecture in a data warehouse for several reasons.
- Ingestion and analytics workloads are completely isolated from each other automatically. You don’t have to impact anyone else’s queries.
- All materialized views and indexes are built off the raw data, which is unchanging. You can simply add new tables, indexes or materialized views instead of changing existing ones.
- During ingestion of new data into the fact and dimension tables, everything is automatically updated.
Now let’s work through how to achieve the performance.
Sub-second query performance at scale
Let’s assume you can build this architecture (you can.) Once the data is loaded and ready, what needs to be done differently in a cloud data warehouse to support true, interactive query performance, which really means performing any operation in 1 second or less?
Cloud data warehouses like Snowflake, Redshift or Athena have not implemented the right optimizations needed to deliver consistent second-to-sub-second performance that way a data warehouse like Firebolt has. If you want to understand the differences in more detail you can look at the data warehouse comparison guide.
But here are some key areas to consider and compare:
- Storage access latency: Measure the performance of cloud data warehouses using first-time queries for ad hoc performance. When the data isn’t already cached, most cloud data warehouses pull any partitions over the network that might contain data they need for a query. That can take seconds to minutes depending on the type of query, size of data, and data distribution. Firebolt only pulls the specific data it needs, not entire partitions. That makes a huge difference in total query times.
- Query performance: Build your list of query optimization techniques including vectorized processing, cost-based optimization, just-in-time compilation and caching. Then line up the different types of queries you need, and benchmark. Snowflake does vectorized processing and some cost-based optimization. But the exact level of optimization will come out in the benchmarks.
- Indexing: Indexes matter. They have always mattered. So do pre-computing results like aggregations. One of the key reasons Firebolt delivers 10-100x faster performance is that it uses a primary composite index. Benchmark with and without various types of indexes to understand relative performance.
- Concurrency: in today’s world, a lot of employees and even customers will want to use ad hoc. Troubleshooting in operations requires ad hoc. Both Athena and Redshift have limited concurrency (see the data warehouse comparison guide for more.)
Ad hoc analysis requires join, groupby, and filter operations to be really fast without having to change the fact and dimension tables.
But why a data warehouse and not a data mart technology like a specialized analytics database? The short answer is that they don’t let you perform ad hoc on enough data. They have either restricted capabilities too much for the sake of performance, such as not supporting joins, or are too domain specific, like time series databases, and not focused on general-purpose ad hoc needs.
If you really want to make sure you have good ad hoc performance, remember to benchmark first-time query performance. Clear your caches in your BI tool and your backend. Most cloud data warehouses use tiered caching to support repetitive query workloads well. This is why FiveTran and others focus on first-time queries and clear all caches each time. They’re testing ad hoc performance.
High-performance joins
Joins are one of the biggest issues. Some experts will explain a denormalized table is always faster than a join, especially when you use columnar storage and indexing for groupby and filter operations. One problem is that you need to denormalize in advance. With ad hoc, you never know what you need. Another is that you will need to join two large fact tables. It is inevitable. Sometimes they are so big, you can’t just join and keep all the data. It’s too big.
If you have join support, it definitely helps reduce storage and reloading. With joins, you can limit changing attributes like promotions or shelf location in retail to dimension tables, for example, which can allow you to do incremental loads on your fact table. If you denormalize and are dependent on columnar storage with immutable partitions, you will end up having to re-load a denormalized fact table on a weekly or even daily basis.
Even more importantly, joins let you do much more ad hoc, much faster. If you require a single table for analysis and don’t yet have the data merged in, you will be waiting weeks while your ETL team makes the changes in Spark or some other tool. With a join, it’s immediate. If you have a fast way of loading tables from a data lake and do fast joins, that might take hours. But it’s not weeks.
It turns out you don’t need to give up joins to get performance. You can do high-performance joins using the right optimizations, including indexes specifically for joins. It’s not always 5x slower at best. Firebolt’s join index basically replaces full scans with a hash table index in RAM and lookups. Do the benchmarks and compare to make sure you understand where to use join indexes vs denormalization. You can do both.
Decoupled storage and compute helps as well. Most analytics databases cannot isolate workloads, so they throttle users and limit changes. With decoupled storage and compute you can bring up new engines with overlapping data for different workloads and isolate ad hoc from other mission-critical analytics. You can do any combination of joins and denormalization as you see fit for the specific analytics.
Fast groupby and filter operations
Groupby and filter operations are the core of slice-and-dice ad hoc analytics. The way you get performance at scale is indexing. Beyond a primary index and sort order to minimize scans, there are other indexes and pre-computations that matter.
Aggregations are critical. When you pre-compute aggregations you can dramatically improve performance. Firebolt has an aggregating index which supports any standard aggregation operations like distinct count. But it can also store the entire fact table and use a different composite primary key and sort order, and be used with join indexes. Other databases use materialized views. OLAP servers build multi-dimensional sparse cubes. Make sure you can aggregate.
Filters are the other major operation. When you implement a composite primary index, the columns included in the index will perform fast filter operations. For other columns, you will want to use other types of indexes to “complement” the primary index. In Firebolt, you can use aggregating indexes, but there are other options.
Support for semi-structured data (JSON)
Not all data is relational. You need to be able to perform fast operations against JSON and other semi-structured data. This is a really big limitation with most cloud data warehouses. Redshift requires you to flatten, or unnest your JSON into separate columns. Others like Athena and Snowflake store JSON but as text, which is part of what leads to their slow performance and higher cost with JSON. If you want reasonable performance, you have to flatten the JSON.
Firebolt added native support for semi-structured data, with optimized lambda-style functions used within SQL, specifically to deliver second-level performance for any combination of JSON and relational data. Because a lot of data about customers and operations comes in the form of JSON, you should plan on using it in ad hoc. One company had been doing extensive JSON flattening and customization for customer-facing analytics just to get each of their customers the right dashboards and ad hoc operations for their own data, which was always different. They replaced the customizations with two JSON-based columns with built-in logic that automated the personalization they needed within the same dashboards. You can expect to see similar benefits.
High concurrency
This final point is important. Many more people need ad hoc today. Most analytics databases and data warehouses do not support high concurrency. They were built to support smaller teams of analysts, not larger groups of employees or customers. Amazon Athena has a default limit of 20 concurrent users and has to throttle each account to protect everyone else using the same shared resources within Athena, just like BigQuery. Redshift has a limit of about 50 queued queries across all clusters, even when using multiple clusters for load balancing.
Make sure you benchmark ad hoc performance, concurrency and scale together.
Plan ahead for ad hoc analysis
You don’t need to implement ad hoc right away. But you should make sure you don’t paint yourself into a corner where you can’t easily implement it. Luckily there are only a few things you need to do right up front.
First, make sure you build a data lake that is just a lake, not a data warehouse. It should be simple storage, like S3 on Amazon, stored as Parquet files. It does need to store raw data and semi-structured data as well.
Second, identify where you already have ad hoc in the form of extracts, spreadsheets and even custom applications across analysts, employees and customers. I added customers because we have several customers who provide analytics to their customers, and it’s ad hoc. Keep a map of these projects and what data they need. The hardest part is not choosing the right data warehouse. It’s the data management and governance required to integrate, cleanse and make the data consistent for different users.
Be ad hoc with your data, not with your roadmap. Plan ahead.
If you have questions, be sure to attend the webinar and ask your questions.