As the data mesh architecture approach we use to build the Data Commons enables decentralisation and organization of data along domain-driven lines, we also need to define common data governance guidelines, standards and controls that local domain team of contributors can follow as part of their implementation. This covers both the data management processes and practices (data schema management, metadata management, data lineage management), mainly governed through shared documentation and code reviews by the community, but also the shared data infrastructure and services layer that various domains can leverage to build their own pipelines from pre-approved templates and guidelines that ensure security and compliance. In this section, we focus on architecture domain drivers related to the federated governance process as well as the design of the shared platform supporting it.
-
FG-001 - Shared model for federated governance: We aim to support a high level of independence and accountability by the domain product owners from production to consumption in terms of how to manage their data and how they can best scale. Rather than enfocring a command-and-control centralized governance function without consideration for the nuances of each domain, the data mesh paradigm requires a federated governance model where responsability for governance is shared. In practice, this means:
-
We centralize governance in terms of managing global risk and compliance policies and standards of the common technology platform including data security, owning inter-domain data and data pipeline management standards, data lineage reporting and auditing. Also in terms of security implementation, identity management (authentication) is managed centrally and access management is provided and delegated to the respective domains.
-
The domains therefore directly own data provenance, data quality (both definition and measurement), data classification (in the form of a data dictionary and data set metadata communicated to a cross-domain data catalog), authorization entitlements (via role-based access control management), adherence to compliance and terminology standards, and definition of inter-domain data entity standardization.
This means that some governance aspects would be set at the data mesh level whereas others are managed at the discretion of the domain, within common standards and best practices. We elaborate further on this for each specific governance aspect in the following list of drivers.
-
-
FG-002 - Shared built-in delegated authorization system for unified security management: Data platforms and the need for ELT pipelines increases by definition data sprawl across the organization due to the need to access and replicate the data and / or generate derived data across pipelines. We use a shared delegated authorization system built into the single data access layer (at the level of the Distributed SQL Query Engine) to minimize the need for data duplication, therefore reducing data sprawl, and also providing the ability to ensure that we have consistency in authorization and entitlements across the organization. This model of federated access is representated in the diagram below.
-
FG-003 - Data compliance through automated and centralized data lineage management: Data governance management is the ability to assess and monitor whether the data follows any and all required policies e.g. GDPR, SOX, etc... A critical dependency to this is the requirement for the ability to capture and store data origin, what happens to it in data pipelines over time (and why) and trace it all the way to its distribution, namely the ability to produce and maintain a data lineage. In the context of our data-as-code approach, we therefore ensure compliance through an automated and centralized data lineage management capability, closely integrated with the authorization system, which provides an immutable record for all data transactions / activities through the following capabilities:
-
Every version of data pipeline code, models (if relevant) and data is captured and tracked, using automated data versioning which provides a complete audit trail for all data and artifacts across pipeline stages, including intermediate results
-
The platform maintains historical reproducibility of data and code within compliance requirements (in particular time period)
-
The platform manages relationships between all historical data states (data lineage). This includes capturing and storing key metadata attributes of the pipeline execution such as source data systems involved, rules or models used to process the data, time stamps for each state when data is created, added, processed, deleted and finally organizational information such as domain owner, data format and documentation, and retention policies.
-
-
FG-004 - Data quality is quantified and communicated to data users: The domain owner is responsible for establishing a data quality assessment framework that is implemented into the pipelines, which requires the ability to measure, report on quality and also manage distribution based on clearly established quality gates that are implemented as part of data testing.