What Is Data Mesh?
In 2019, Zhamak Dehghani had a stroke of genius, which brought to life an organizational, architectural, technology-agnostic framework for big data molded by the principles of mesh networking. This framework became known as data meshing—a model based on the concept of decentralization.
Literally, a data mesh is a dense network of nodes containing data. Unlike the centralized and monolithic architectures of data warehouses and data lakes, a data mesh is a highly decentralized architecture.
The use of centralized data silos, aimed at creating a single source of truth (SSOT) usually populated with extract-transform-load (ETL) processes, has historically led to big challenges, like bootstrapping the monolith and scaling in terms of both the number of consumers of data and computational resources. In addition, users of centralized monoliths typically have difficulty in finding quality data and interpreting them due to a lack of data ownership.
The first and most important principle of data meshes is the idea of domain-oriented decentralized data ownership and architecture. It aims at resolving a specific problem: moving the analysis of data to the same domain where the data itself was born. In this approach, the experts of the domain from which the data has come are the same responsible for its quality and interpretation. And due to their experience in the domain, the problems regarding interpretation and quality of the data automatically decrease.
Data mesh applies DDD (domain-driven design) to data-based architectures. The first implementation of DDD within software architecture was microservices. Just as microservices are software components that expose elementary application functionality, data products are software components that expose elementary data and analytical functionality.
As with microservices architectures, the data mesh disaggregated model requires a series of rules and tools to follow. In particular, data products must comply with the rules identified in the DATSIS acronym, which states that a product must be: discoverable, addressable, trustworthy, self-describing, integrated, and secure.
Moreover, in applying a data mesh architecture, the organization risks seeing an increase in the heterogeneity of the technologies, making the sourcing strategy very complex. It is, therefore, important for a company to equip itself with a data infrastructure platform—like PaaS cloud platform—which allows standardization of data product development and the introduction of a common language between the various domains.
Furthermore, it is very important that the company equips itself with ecosystem governance tools to provide comprehensibility and visibility of data products. On these tools, each data product can be registered and consequently searchable and reusable according to specific authentication and authorization policies.
Data meshing is not a strict technical implementation. It’s a transformation of data software architecture that promotes moving from a monolithic data lake to a distributed data-driven architecture. Its features include moving from centralized data ownership to decentralized ownership, shifting from a pipeline-as-a-first-concern understanding to a domain data-as-a-first-concern understanding, moving from creating siloed data engineering teams to cross-functional domain data teams, and focusing on data as a product instead of a centralized data lake/data warehouse.
At times, some people may confuse a data mesh and a data fabric. The difference between the two is that a data fabric uses a decidedly architectural approach to data access, whereas a data mesh architecture is more about connecting users with data processes.
Advantages of Data Mesh Implementation
Let’s now enucleate the principal advantages of implementing a data mesh architecture.
- Agility: As a data mesh is a distributed architecture, it enables decentralized data operations and increased team performance. This, in turn, improves time to market, scalability, and agility due to reduced complexities and IT backlogs.
- Resilience to Technological Progress: A data mesh paradigm represents a very strong guarantee against the risk of technological obsolescence. In the future, when new technologies emerge, any data product will be able to adopt them without a problem.
- Speed: Thanks to its well-governed and decentralized data storage, data mesh provides a simple, well-governed, and centralized infrastructure based on self-service for faster and secure access to data.
- Decoupling: As with microservices, data mesh brings the great advantages of a decoupled infrastructure, granting, as a result, easier scalability, technological independence, and organizational independence.
- Flexibility: A data mesh provides flexibility to enterprise organizations to become more vendor agnostic. Additionally, as a result of the inherently decentralized nature of data meshes, the individual domains become responsible for the quality, security, and transfers. This connectivity enables all kinds of users to have access regardless of their physical location, overcoming the problems of traditional data sharing that falls under the scrutiny of international guidelines.
- Compliance: The distributed architecture reconciles data ingestion with its sources and formats to allow businesses to control security and helps create a compliant data platform.
- Access: Data meshes prevent the creation of data silos, improving data access for cross-functional teams and transparency. Its unified infrastructure enables business domains to share the data products efficiently, enforcing standards.
- Data Governance: The distributed architecture of data meshes allows businesses and companies to control their security from the source, simplifying compliance with global data governance policies.
Challenges of Data Mesh Implementation
Adopting the data mesh framework comes with several challenges caused primarily by the inhomogeneity of software systems that compose the nodes of the network.
- Specialization: Data mesh implementation requires specialists to create domain-specific ETL, data lake implementations with complex data systems, and so on.
- Data Redundancy: Data mesh makes data governance more difficult to manage due to its multi-cloud and hybrid infrastructure. Redundancy occurs when the data of one domain is repurposed to serve the business needs of another domain, which can impact resource usage and data management costs.
- Adoption Costs: Decentralizing data management to adopt data mesh implementation requires major changes when switching from a highly centralized data architecture. Ecosystem governance tools and data infrastructure platform tools needed to maintain a good quality data mesh solution come with the cost of bootstrapping and maintenance.
- Complexity: Enterprise-wide data models must be defined to merge various data products and make them available to authorized users in a central location.