April 9, 2025

Revolutionizing Data Governance with DataStrato’s Unified Open Source Approach

April 9, 2025

Revolutionizing Data Governance with DataStrato’s Unified Open Source Approach

No items found.

Listen to this article

Powered by NotebookLM
Listen to this article

In this episode of The Data Engineering Show, the bros sit with Lisa Cao, Product Manager at DataStrato, to explore data catalogs and Apache Gravitino, a unified metadata lake used to manage access and perform data governance for all data sources. They discuss data catalogs and how they refine the data management process.

Listen on Spotify or Apple Podcasts

Episode Highlights

What is Apache Gravitino? (01:24)

Apache Gravitino is a meta-catalog that serves as a unified data governance and security layer used to manage different data systems. Lisa shares that Gravitino was the first to release an iceberg rest catalog and ended up open sourcing for the general community to use and as time passed, Polaris and Unity Catalog were also announced in open source. She highlights that although Gravitino, Polaris and Unity Catalog are very similar, Gravitino differs in that it is able to support multiple catalogs.

Unifying AI/ML and Big Data Stack (03:15)

One of the interesting things about Gravitino is that it offers more than just a catalog of data models and these model catalogs are the first step into looking at how to merge two worlds of AI and ML catalogs. Lisa shares the goal of effective management, that is, creating a system that can store and manage different types of data models, track changes to the models, and control access to the models.

Simplifying Data Governance (10:49)

Think of Gravitino as a “traffic cop” that helps to manage and secure data from multiple sources. It is crucial to have a system that provides unified access control across all data sources, allowing teams to manage access and data governance so that ML teams don't have to worry about access. Lisa says that Apache Gravitino is the system that makes data accessible to different teams and users while making sure that it is secure and governed appropriately. 

The Gravitino’s Query Engine Solution (21:34)

Every query engine has its own way of managing data, which makes it difficult to switch between engines - you have to reconfigure everything. Lisa highlights that Gravitino solves the problem by providing a single layer of data governance that works across multiple query engines.

Navigating the Fast-Paced World of Data Engineering (24:41)

Lisa talks about how fast the data engineering space is moving and shares some insights to catching up;

  • Don’t try to learn everything at once.
  • Don't get too deep into every tool
  • Look for real-world adoption

She warns against the social media hype that can amplify the messaging around new tools, making it seem everyone is using it, when in reality, that can’t be easily seen.

Read all the posts

Intrigued? Want to read some more?