What is a data lake
A data lake is a store of raw data. Think of it as a logical data store that can be implemented many different ways. It is just a data store, not compute; that’s important.
It was first defined by James Dixon:
“A data lake is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data.
Data lakes today are a mandatory part of any (batch and streaming) data pipeline. They will also change over time, especially as your architecture matures and you add governance. Just compare the raw data lake to Alex Gorelik’s explanation of a (fully governed) data lake. Make sure your pipeline allows you to change your data lake.
Challenges
Data lakes still have three phases:
- Using a data warehouse as a data lake, including modern cloud data warehouses
- Trying Hadoop (this is declining in use)
- Creating a modern data lake
Each phase comes with its challenges.
By 2010, data warehouses were starting to break down as the central place for all data for a host of reasons:
- Data warehouses could not store or process the massive amounts of new “Big Data”. A larger data store and staging area was needed.
- People were trying to store and manage data for other needs beyond traditional analytics, such as data science, and machine and deep learning.
- BI users were looking to get around the data warehouse because they needed to get at new data, and build new reports faster.
Over the last decade people started to try using Hadoop.
What started with Hadoop evolved into data lakes.
If you build it with any compute, such as Hadoop, a data warehouse, or even a federated query engine like Presto, you immediately limit your access to the data, and the types of computing you can perform. The same is true for Kafka (and Confluent), or Spark (DataBricks). Do not couple your data lake to any compute.