Big Data Analytics for Life Sciences

Listen to this article

In a recent webinar, IQVIA deep dive into how they harness best-in-class technologies to overcome analytics adoption challenges and maximize impact of BI solutions for faster and more informed decision-making in healthcare. Here are some key takeaways:

Speed-of-insight is essential for driving adoption at the point of decision-making.
The data variety and complexity typical in Life Sciences creates challenges to centralizing data and deriving trustable insights.
AI-driven insights can only extract proven value if the data pipelines and models work effectively.

‍

In the following segment, Jeremy Stroud, Director IT Architecture at IQVIA and Matan Sarig, Lead Solution Architect at Firebolt discuss how IQVIA uses Firebolt on AWS to address these challenges.

Transcript highlights

Jeremy: The way we've designed our tech stack is for live queries, so we need a data warehouse under the hood that can scale and support that need. The days of caching data in-memory in cubes have gone. Now we’re looking at a new age of data warehouse technologies that can provide sub-second responses on large data sets with high concurrency. If we don't achieve that, we don't have that performance, and unfortunately we have a cascading negative effect on the adoption of our BI applications.

You're probably familiar with the likes of Snowflake and Redshift. What drew us to supporting Firebolt within our analytical-ready APIs is their philosophy on their technology, which is pure speed. They're built on the idea of delivering insights as quickly as possible.

There were key areas we looked at with Firebolt to support our stack. First of all, we needed a technology that could support sub-second queries. There's a big divide in the industry today around technologies focused on big data use cases. They've targeted the ability to scan petabytes of data. There's a set of legacy database technologies that are starting to evolve and support those larger use cases, but historically have been better suited to smaller data sets, transactional use cases.

If you think about the diverse landscape of data applications, we have customer-facing analytics, we have batch analytics, Machine Learning algorithms, and business processes. All of those vary in the speed of data they need access to, volume of data, and speed in which that data must be returned. For us, query performance needed to be consistent across those.

Whether we had 100, 200, 250 users accessing a BI tool, we needed that consistent sub-second query performance. On the flip side, if we needed to scale up to support a huge data query, we needed to support that. Query performance and concurrency was a huge driver. They're also scaling for the future to support huge data volumes. So we have full trust that in 5-10 years when data volumes get to that level, that's the right technology to support that.

At this point, I'd like to invite Matan to talk through some of the use-cases, and why Firebolt has become such an important part of our analytical ready APIs and our tech stack.

Matan: Sure. My name is Matan. I'm the lead Solution Architect at Firebolt, working closely with IQVIA for quite some time now. I’d like to address how Firebolt’s over AWS was able to address the challenges that IQVIA have raised.

Efficient data ingestion

The first challenge I'd like to address is data ingestion. There are a few key concepts that allow us to be efficient when it comes to data ingestion. The number one concept is Firebolt's decoupled storage and compute architecture, which allows you to flexibly align the compute resources to the task you want to fulfill. For example, you could, for a large backfill of historical data, utilize a very large engine, and for a smaller incremental insert, use a smaller engine.

The second key here is how we interact with external data. In Firebolt, any engine can access external data - external meaning that it sits outside the actual data warehouse, but you can actually expose this data using a relational format.

This means you can use SQL to interact with external data, explore or transform it however you like before ingesting it into a Firebolt table. This actually allows Data Engineers and Analysts to explore external data, and build pipelines without depending on others.

The third key is how Firebolt scales data ingestion execution. Firebolt's data ingestion processes are very opportunistic and aspire to run the process as concurrently as possible in order to accelerate it. Data is also sorted and compressed as it is being written into the Firebolt data warehouse.

On top of those three keys, there are a couple more important features that I like to bring up as we're talking about data ingestion. One is our advanced support of semi-structured data. If in the past you’d have to explore, explode, or unpack arrays into individual rows within your table, in Firebolt, you could actually store the data within arrays and be able to access it very efficiently, when you read data from the table. Firebolt also has support in commands like deletes and updates, which also perform very efficiently. So this is how we optimize for data ingestion.

Sub-second read performance

Next, I’d like to talk about how we optimize for read performance, which is really what Firebolt is all about. Firebolt was built to accelerate customer-facing workloads, to deliver sub-second performance.

The first key to how Firebolt is able to do that is all about pruning data using Sparse Indexes. When you create a table, you can decide on a Primary Index for this table. This Primary Index actually determines how the data is stored and sorted in storage.

When interacting with this data, every query goes through this index. This allows us to reduce the scan of the table in a very significant manner. Every time you interact with very large data sets, really you'll be interacting or scanning a very small portion of the data thanks to our Sparse Indexes and our pruning capabilities.

The power of indexes

Next, I'd like to talk about Aggregating Indexes and Joint Indexes, which are very useful in Firebolt. An Aggregating Index is an object that sits on top of a table and is designed to serve aggregating queries. This is useful for aggregating queries, which crunch a lot of data.

In Firebolt, you can define this index using a set of group keys and aggregate functions, and then you set it only once. It’s self-maintained and any future query that would address the relevant attributes and aggregate function will be satisfied by the index. If not, it will fall back to the actual table.

A similar concept is the Join Index in Firebolt. It’s automatic caching of frequently joined datasets. So if you have a lot of join queries within your use case, Firebolt will be able to automatically detect the relevant dataset, and cache them in RAM for faster access in future queries.

High concurrency

One more thing that Firebolt does really well is high concurrency, with the ability to run between dozens and hundreds of queries concurrently. It really depends on the efficiency of each query individually. It's important to implement the three indexes I mentioned to get each query to scan as little data as possible for the engine to support higher concurrencies.

On that topic, I’d like to mention semi-structured support as well, but from a different angle. Once you’re able to store semi-structured objects in Firebolt, there's actually no need to unpack them on the read side as well due to some really powerful Lambda functions that are implemented into our solution.

Imagine you have an array with some elements, and it's stored within your table. You will be able to use the Lambda function as part of your SQL syntax to explore or extract arrays, different elements, and it will essentially be an iteration that will run on each record individually as part of your SQL. This is some of the more powerful programmatic power that is exposed as part of Firebolt SQL.

Jeremy: Matan, I think this is really important, especially when we look across the data we have within Life Science, which is typically quite diverse. We've mentioned some features here around joins and aggregates. These are really, really important as we look at complex KPIs. And we maximize the speed of those joins. And this feature around aggregating indexes - very, very unique, and something we've actually built in to support our analytical APIs.

Similarly on the semi-structured side, you can store less data because it's not exploded by many-to-many relationships. This cheapens your storage, and at the same time, you can get close to that same performance through reads. Very interesting set of features.

Controlling costs

The last topic I want to discuss is how Firebolt enables users in IQVIA in particular to control cost and also save on the cost of their data warehouse. We actually do that on two different fronts.

The first is compute. Firebolt allows you to easily align the engine to designated tasks. If there’s a large ingestion, you could assign a larger engine to it, both in terms of the engine spec and the number of nodes within the engine. If there's a need to scale the engine for any other task, either add or reduce nodes, this could be done very easily. But we're not only enabling users to tweak engines in terms of number of nodes, but also in terms of the specific spec of the engine. For example, decide on RAM CPU and disc, which are all important components to different stages of query execution.

The second front is storage. Firebolt storage is compressed and stored on AWS S3 and the actual price of storage is the least price of S3, which is $23 per terabyte per month, which usually means not such a significant cost, enabling storing very large amounts of data within the data warehouse, while having the flexibility to utilize the specific compute resources that are actually required to query the data efficiently.

To recap - we talked about how speed of insight is critical, and how IQVIA and Firebolt are working together to get all those little incremental gains that will really give you a competitive advantage. This approach of the analytical-ready APIs means you don't have to keep reinventing the wheel, but you can plug the next big thing in and accelerate your adoption of these new technologies.

How IQVIA Maximizes Analytics Performance for Healthcare

Listen to this article

Transcript highlights

Efficient data ingestion

Sub-second read performance

The power of indexes

High concurrency

Controlling costs

The $100M Problem: How Lyft's Data Platform Prevents ML Failures with Ritesh Varyani at Lyft

Firebolt Team

"Where Do I Put My Logs?" A Conversation with TLDCRM's CEO on Solving the Impossible

Sergio Ferragut

Late Materialization: How Firebolt Makes Top-K Queries 30x Faster

John Kennedy

Intrigued? Want to read some more?

How IQVIA Maximizes Analytics Performance for Healthcare

Listen to this article

Transcript highlights

Efficient data ingestion

Sub-second read performance

The power of indexes

High concurrency

Controlling costs

The $100M Problem: How Lyft's Data Platform Prevents ML Failures with Ritesh Varyani at Lyft

Firebolt Team

"Where Do I Put My Logs?" A Conversation with TLDCRM's CEO on Solving the Impossible

Sergio Ferragut

Late Materialization: How Firebolt Makes Top-K Queries 30x Faster

John Kennedy

Intrigued? Want to read some more?

Don't miss a post, subscribe to the Fireblog