We created an abstraction layer to manage our data platform DataHub #advice-data-governance

We created an abstraction layer to manage our data...

gorgeous-dinner-4055

01/13/2022, 1:54 AM

We created an abstraction layer to manage our data platform, described here. All of our configuration is managed through Git, and has code review process to getting anything checked in. Because of this code review process, we are able to be more proactive about some of the questions that you're raising @acceptable-potato-35922:

We want our Data Producers to be responsible for the PUSH of metadata to DataHub and our team to be responsible for the platform itself - not the individual datasets.`

Since our table configuration goes through git, we are able to make sure owners are assigned to tables, and metadata about a table is set. We can then push that data to Datahub from Git once it's checked in. One pain point of our data users(as opposed to data owner) is: It's such a pain to add new metadata! Can't it be easier??? So, our current thought is to govern the base set of info that needs to be set for a table to get created, and enable users of that data to easily add new metadata about the tables directly in Datahub. That being said, governance has been one of the biggest problems, and we are hoping to address a lot of those issues by extending datahub for our usecases.

acceptable-potato-35922

01/13/2022, 3:19 PM

Thanks for sharing @gorgeous-dinner-4055! In your situation, who’s responsible for the commit in git, and who’s responsible for the push from Git to DataHub? Is it the Data Producer itself, or the team managing the DataHub implementation?

acceptable-potato-35922

01/13/2022, 3:22 PM

Also, I completely agree on setting a minimum number of metadata points in order to create a table. We are gearing out to do the same. In a past life I’ve created a Registration layer that makes each data producer declare their table and execute their own push of metadata. It creates a bit of an initial headache, but it’s a one time setup and you get all the metadata you need right up front.

gorgeous-dinner-4055

01/16/2022, 6:22 PM

The Git -> Datahub is automated since we(the platform team) provide the standard schema to define new tables. Data owners are responsible for creating the push to git. We have custom tooling similar to DBT + a UI layer that assists users in creating new tables which has been a huge success.

a Registration layer that makes each data producer declare their table and execute their own push of metadata.

One of the nice things with standard schemas is that you can automate all the downstream interactions 🙂

acceptable-potato-35922

01/18/2022, 7:04 PM

Yeah, that’s a good point. Thanks for sharing @gorgeous-dinner-4055. We are still in the infancy of our journey, but will certainly take a page from your recommendations!

Open in Slack

Previous Next