In a recent workshop, 25 data pros working in the Ad Tech industry met Matthew Darwin, a Solution Architect at Firebolt, to discuss challenges and solutions for querying large data sets efficiently.
What we covered:
- Ad Tech data challenges
- Demo: Scanning 31 TBs in 0.4 seconds
- Working with indexes
- Analyzing clickstream data without JOINs
More data equals more problems
More data equals more problems, especially when it comes to cost control and storing data. The group discussed balancing speed and cost, with some members stating a goal of keeping query times under one minute, while others said users typically expect no more than five seconds. This puts pressure on the analytics industry to compromise in order to achieve sub-seconds results. While pre-aggregating data may allow reports to be speedy, it means that granularity is lost and the ability to drill into reports is limited.
Demo: Scanning 31 TBs in 0.4 seconds
In this demo, Matthew showcases how Firebolt’s cloud data warehouse is able to efficiently process and analyze large amounts of Ad Tech data. The demo begins with an overview of a Superset dashboard from one of Firebolt's customers that displays data for impressions, click-through rates, clicks, and media sources. The dashboard, which is powered by a 54 billion row table with 31 terabytes of data, was originally slow and often timed out, causing frustration for customers.
When tweaking the filters in the dashboard, each filter change triggers a separate query to be run against the data in real-time. Despite the volume of data being queried, the dashboard responds quickly and offers a snappy experience for the user.
So how does Firebolt achieve this efficiency? Take a look at Firebolt's UI, which displays the LTV table. The table contains a primary index sorted by timestamp date, media source, and app ID. Additionally, an aggregating index is created to further optimize the data pipeline. The aggregating index contains aggregations across different metrics and has its own internal index, which sorts the table based on the columns being filtered.
When running a query against the LTV table, which displays the most granular portion of the data on the dashboard, the query runtime is about 0.4 seconds, and the amount of data being scanned is only 3.2 GB, despite the table's original 31 TB size. This data pruning effect is a direct result of Firebolt’s indexing options - primary and aggregating indexes. While primary indexes optimize data layout, aggregating indexes simplify data pipelines. Both are automatically used by Firebolt's optimization engine. Aggregating indexes eliminate the need for query rewrites and allow queries to be run against the original table.
How Firebolt works:
Firebolt has been designed to help developers and data engineers build data analytics apps quickly and reliably. Interacting with Firebolt is simplified through an ‘Everything in SQL’ approach. This includes provisioning, automation, query execution etc. In addition, SDKs and API are also available to interact with the platform. Firebolt's connectors make it easy to plug the system into existing data stacks, such as Superset, Tableau, Looker, and Airflow.
Architecture:
The data is stored externally and ingested into the Firebolt S3 backend for long-term storage. The system uses decoupled storage and compute engines, enabling you to tune them to their workload. Firebolt has four main types of engines, including compute-optimized, SSD-optimized, RAM-optimized, and balanced engines. The engine types allow you to choose the sizing of compute that meets your workload, enabling efficient queries on minimally-sized hardware.
Indexes:
Firebolt has three types of indexes: primary, aggregating, and join indexes. The primary indexes sort the data and compress it, providing good compression ratios and increasing speed. The aggregating indexes are used for repeated workloads, such as dashboards and reports. The join indexes allow you to convert expensive joins to simple, efficient in-memory look-up operations, improving query performance.
Pricing:
Firebolt's pricing is simple and transparent, with no upfront commitments. It uses pay-as-you-go pricing, and you only pay for the engine when it's running, with no charges when it's stopped. Storage is charged based on the as-is S3 list price, which is $23 per terabyte per month (depending on region). Firebolt's efficiency allows you to choose the compute size that meets your workload, making it more cost-effective than its competitors.
Demo: Analyzing Clickstream Data - Without JOINs
Clickstream data is a common use case that requires fact-to-fact joins to identify conversion rates. The process is granular and row-level, making it difficult to perform comparisons without performing a join on the fly. Using the example of a Clickstream Table with event type, session ID, event time, and viewer Matthew demonstrates a subquery to select clicks where the event type is two and the conversion event type is three. This type of query can be slow due to the granular nature of the data, but can be improved with creative thinking around data modeling and leveraging aggregating indexes. By nesting the event types within an array and using a subquery with the Nest, it's possible to find examples where event type three exists, but it doesn't contain the event. This approach reduces the query runtime from 28 seconds to about five seconds. Overall, this demo highlights the advantages of using Firebolt for clickstream data analysis and shows how creative data modeling can lead to significant performance improvements.