Hi Community! How does DataHub support the concep...
# advice-data-governance
r
Hi Community! How does DataHub support the concept of governance Status? (Draft / Published). Ideally we would like owners to be able to curate assets they manage before publishing them to all users. Assets still in draft mode should not be visible to everyone.
đź§  1
đź‘€ 1
I think I may have found the feature request that correspond to this: https://feature-requests.datahubproject.io/p/enhance-asset-life-cycle-management Surprised it hasn’t received more 👍. Curious to hear how others manage to workaround this issue.
m
How would the workflow to mark a dataset as draft be? The metadata ingestion aims to be fully automated, i.e. if a dataset is available it will be ingested to datahub as part of the auto discovery. One way i can think of is to have a different environment for your datasets and these datasets metadata can be ingested to DEV environment in datahub and you can use ACL to limit access to the environment (not sure if it is implemented yet).
r
The way I see it (and seen in other products) is that all datasets are marked as draft upon ingestion, as default. Publication is an active part of the governance process.
m
What will you do if there is a change in the schema? The dataset go back to draft? And why is it a draft if it is already in a production environment?
There's no such functionality like you mentioned. But there is work around, you can ingest everything to a default Domain i.e. draft domain, have right ACL on that domain. The data steward then can move the dataset to its correct domain and visible to the rest of the org
I personally don't like this, metadata should be visible for people to discover. Most of the time, It does not contain sensitive information to be restricted like that...
The More restrictions we apply to metadata, the less it is discoverable for data users. And it causes more workload to data stewards. What you should aim for is to have right metadata on all dataset. And datahub has some tools to flag missing metadata on datasets
r
This was also our thinking initially but the feedback we’ve got is that exposing assets that aren’t documented or coming from “internal” schemas (e.g. staging models) creates more confusion and isn’t delivering a great user experience. Owners would rather expose what should be shared and has been properly documented.
a
Will make a note to address this feature request!
m
We’ve found that DataHub’s Views are a good way to limit the noise from the ungoverned data assets, so you can search and browse curated data assets by default.
How do you get governed assets into the View? In Acryl aka Managed DataHub, you can use the Metadata Tests automation to auto-tier data assets based on your definition of good. In oss, you could achieve something similar as long as you have rules / tagging systems that enable you to mark data assets a certain way.