A complete solution for open data platforms, enterprise data catalogs, data lakes and data management. Open source, mature, fully-featured and production ready.

DataHub

Hello. We are planning to evaluate datahub entities (mainly datasets) against company compliance rules which could be reported back to the dataset owners.  When I think about it that seems very similar to the ingestion pipeline but reversed such that the source would be datahub, the transformers would be the compliance rules and the output a db, console, api etc. Has anyone else done something like this already? Would it make sense to make a feature request?

Interesting! What kind of rules are you thinking about?

Could be anything but basic examples:
• Missing basic documentation
• Missing owner
• Missing schema


Also, having a typed sdk for getting data out of datahub would be great, regardless of the usecase.

Ok, makes sense. We were planning to do some of that already at ingestion stage (basically forcing the existence of the metadata in the source as well), due to lack of the feature you're describing.

<@U01HXE1SE23> Yep there’s a couple ways I can imagine doing this: (1) fetching data via the graphql interfaces (or possibly from GMS) and running checks (2) fetching data directly from datahub’s sql store and running scripts on that (3) subscribing to the MAE kafka topic and running validations as the metadata changes.

On (2) - this is actually what our <https://demo.datahubproject.io/chart/urn:li:chart:(superset,5)/dashboards?is_lineage_mode=true|demo> is doing right now! We have a backup of the core mysql table in S3, which is processed by an <https://airflow.demo.datahubproject.io/tree?dag_id=datahub_analytics_refresh|airflow pipeline> and then fed into <https://superset.demo.datahubproject.io/superset/explore/?form_data=%7B%22slice_id%22%3A%205%7D|superset >to produce charts like “dataset documentation coverage by platform” etc

Yup, but then I need logic to only use the latest version of each aspect for example.

Using the REST apis should solve that automatically but ofc may be more cumbersome than querying the rdbms in other ways though.

&gt; Yup, but then I need logic to only use the latest version of each aspect for example.
This specific one is just adding a WHERE version=0 clause, but using the sql tables might be cumbersome due to the json nesting

I’d personally recommend using the graphql interfaces that the frontend provides, since it can be strongly typed and the codegen tooling around graphql is quite good, but ofc it depends on the exact use cases

Ok. Any thoughts on whether a library or framework for this use-case is worthy of a feature request?

Yep I definitely think so. We should probably start with a set of scripts/recipes about how to validate against common rules (e.g. missing owners), and then build up to a proper framework

<@U01HXE1SE23> Being able to take action based on Metadata changes will definitely be a big part of our H2 roadmap. There are two aspects to your requirement: 1) How to run the compliance validation checks when entities get modified (e.g: new dataset added, schema changes. new PII tag added on a column etc.) 2) Running these checks in batch

for 1) we envision building a higher level framework that allows you to run your logic based on higher level events and strong types from DataHub

for 2) we envision supporting export  into a "metadata warehouse" so that you can run your logic there

Let's set up time to chat more about your requirements