データメッシュ/ ファブリック/ ハブおよびデータアーキテクチャデザインを理解する
The integration and usage of data across the organization is a process that involves not only the data itself but the users’ requirements, architecture design, integration tools, and the domain knowledge to provide the right data selection, ingestion, processing, analysis and insights. We will describe and understand the definitions of some methodologies and provide a perspective to make sense of it all.
Data Mesh
Zhamak Dehghani, CEO and founder of Nextdata, conceptualized the data mesh concept. Nextdata aims to decentralized data at scale through a data mesh native toolset. Data mesh organize and analyze data according to domain knowledge and to offer a shared data platform for interoperability, discovery, and governance, delivered through containers and APIs. This is like platforms offering data lineage, model development, governance controls, and data sharing mechanisms, utilized by industry domain experts or business users. Data is a decentralize and autonomous product, with domain ownership and governance, with shared services and infrastructure for collaboration and data sharing. Examples of vendors include Thoughtworks, StreamSets, WS02, and Upsolver.
Data Fabric
Data fabric organizes data as a distributed network-based architecture to enable an integrated data layer (fabric) over shared components, from data source to repositories and analytics on the cloud. It is a layer of abstraction over data components and allow business users to have visibility of the data processing workflow and data science initiatives alongside the technical or development teams. Basically, it provides uniform data view and utilization across the organization. An example of data fabric are tools and services that provide data processing, data exploration, data integration, data governance, and analytical model development, with prebuilt and preconfigured components, across a common data model and metadata repository.
There are different interpretations of it, it can be connected via a knowledge graph (I have covered this topic in this Celent report on graph theory). Or data fabric can also be understood as a “catch-all” layer to include data processes and data management, connected through API and ETL pipelines. Some illustrations of data fabric are listed below. Some vendors include Cloudera, Denodo, HPE Ezmeral, IBM, Informatica, and Talend.
Source: Enterprise Knowledge
Source: Infoworks and Eckerson
Data Hub
Data hub is a center of data exchange supported by data science, data engineering, and data warehouse technologies, interacting with endpoints such as application, algorithms, and the users themselves. Seamless data flow is at its core and is an approach to determine when, where, and for who the data is utilized, and bring together enterprise data from different sources and format. It centralizes and standardize data, to make it easy to access by users. Data hub examples include Amazon Redshift and Glue, MongoDB Atlas, SAP Data Hub, and Oracle Data Hub.
Source: AWS
Data Warehouse and Data Lake
Data warehouse and data lake are endpoints for data collection and processing, with support for analytics. Data warehouse is a set of structured data with a defined schema and can be derive from a data lakehouse, which has a flexible schema that can accommodate different types of data. A data lakehouse include raw and unprocessed data, including structured, semi-structured, and unstructured data
How Do These Terms Relate?
In essence, these data architecture concepts are descriptions to combine various data management tools to illustrate a data workflow. Data mesh emphasis a data workflow based on industry domain knowledge inclusion and data decentralization. Data fabric describe a layer to combine data pipelining, orchestration, catalog, with emphasis on governance and business users. Data hub is a data exchange with focus on data flow among users, applications, and algorithms.
We can understand these architecture design as belonging to the broader data management topic (including metadata / data/MLOps/ModelOps) and that of the concept of data integration. And to provide a coherent explanation for designing a data architect blueprint, it has now expanded to include other architecture methodology/conceptual terms like data mesh, data fabric, data hub, and the concept of connecting the data from different sources from the data warehouse (which can take from the raw data at the data lake).
When designing a data architecture that fits an organization, we should consider:
1) Data requirements (what type of data to collect and how it will be used, and the requirements of the organization and available systems to process and analyze the data)
2) Data quality (what is the expected maintenance process to ensure data accuracy, completeness, consistency, and timeliness)
3) Data governance (how data will be managed and controlled, and having roles and responsibilities for managing data, with policies and procedure to ensure compliance with regulations)
4) Data security (the protection of data from unauthorized access and the disclosure)
5) Data scalability (to have an architecture that is flexible for future growth and change)
6) Data integration (integrating data across users, application, systems, and sources, and to ensure interoperability and data sharing)
7) Data performance (how data can be process and access when needed, and to handle large volume)
Therefore, we need to approach the selection of an architecture design in accordance with the organization’s requirements and to use these concepts as a guide in developing a unique design that works for the organization. It could be a combination of data fabric for a unified data view, with consideration of domain knowledge and autonomy as part of the data mesh perspective and executing through a data hub design of standardization.
_____________________________________________________________________________
To learn more, Celent tracks this market and has research addressing it (list of recent reports here). If you would like to find out more, please feel free to get in touch with me.
Below are related reports contributed by Celent on this topic:
Introduction to Graph Data Design: Alternative Database and Tools
Securing Insurance Data: Confidential Computing and Data Lineage Use Case
Business Data Strategy: Underwriting and Actuary Case Studies
The Data Force: Cultivating a Data-ready Organization
Data Management for Insurance: Best Practice and Solutions
MLOps Part 2: Examples of Enterprise Machine Learning Deployment Providers
MLOps Part 1: From Machine Learning Innovation to Production