What is a data lakehouse
A data lakehouse is the combination of a data lake with an analytics engine. It provides direct access to a data lake using SQL for any use ranging from raw data access to analytics. You may hear about a lakehouse from three different places
- A data warehouse vendor, usually with decoupled storage-compute and “data-lake scale” who says you can implement a lakehouse.
- A vendor with a federated query engine and data lake-related functionality who says you should implement a lakehouse.
- A customer who implements their own lakehouse by combining a storage-only data lake with a federated query engine.
A data lakehouse is useful for when you need to perform any data access, or any analytics against any data, and you do not know the data or the analytics in advice. If performance is not a major concern, then a lakehouse architecture will work really well.
Data lakes have become an important part of any (batch and streaming) data pipeline. A federated query engine is a great addition for improving access to the data lake because it adds SQL support. In other words, if you have a data lake, you will probably end up with a lakehouse.
That does not mean you should build everything on a lakehouse. You can read more here on how to decide between a data lake and lakehouse, a data warehouse, and a specialized analytics database for each use case.
Challenges
A data lakehouse combines the raw data of a data lake with the flexibility of a federated query engine. But that comes with a price. For example, if you try to use a data lake with a federated query engine, you might face the following challenges:
- All your data is a lot of data. Unless you are a data domain expert, it can be hard to understand which data to use.
- You may need to do a lot of processing, including aggregations, joins, and data cleansing, to make the data usable for any given analytics.
- A federated query engine operating against raw data will never be as fast as a data warehouse or specialized data engine.
The performance issue is a big challenge. Even if you recreate your data warehouse schema in a data lake, creating and maintaining aggregations or materialized views in your data lake, they will not be as fast as a data warehouse that optimizes data during ingestion to improve performance.
For all of these reasons, the broader base of analysts, employees and customers will generally not directly access a data lakehouse. Only data engineers and analytics engineers who support larger groups of analytics users will use it.
If you use a data warehouse as a data lakehouse, it can introduce its own set of challenges:
- Most cloud data warehouses do not support all types of semi-structured data well, or deliver much better performance than a good federated query engine.
- Data warehouses do not store unstructured data very well either.
- The compute engines in a cloud data warehouse are the only way to access data. This can make a data warehouse very expensive as a lakehouse. This cost and SQL-only access can limit data usage.
Benefits
A data lakehouse does make it easier to access raw data using SQL. If the data lake team hosts a federated query engine alongside their data lake, it will help reduce duplicate query engines getting deployed and also reduce the amount of data errors and data duplication as well. A centralized team of experts that also owns data ingestion and data lake formats is able to achieve faster query performance as well.
Firebolt vs a data lakehouse
Many companies have shifted their efforts, including their data integration teams, towards building a data lake as the single source of the raw truth. Firebolt is a high-performance data warehouse designed for use with data lakes. It includes the ability to directly queries against data lakes to ingest data. The most common and easiest way to load data into Firebolt is from a data lake.
When using Firebolt, a data engineer can write SQL to ingest directly from files in a data lake (in S3) and run the SQL in any Firebolt engine. Firebolt makes this possible by exposing files within SQL as tables.