Hi folks As we re integrating some of our metadata sources w DataHub #advice-metadata-modeling

Hi folks, As we're integrating some of our metadat...

bitter-lizard-32293

06/02/2022, 3:47 PM

Hi folks, As we're integrating some of our metadata sources, we had some questions that we thought we'd run by you. The use-cases that folks within our company (Stripe) wanted to start off with were centered around dataset landing times / timeliness, so we decided to start off with publishing events as part of some Airflow completion hooks we have in place. Essentially our plan is to infer the relevant datasets (S3, Trino, Iceberg) based on our custom Airflow task extension code and write out metadata change proposal events to indicate the Airflow task / dag (DataJob, DataFlow) as well as the input and output dataset entities. The part we're trying to ensure is future-proof is when we add additional connectors in the future (e.g. the Iceberg connector to pull schema, stats related metadata for the Iceberg table / dataset) they metadata they emit matches up to the datasets we constructed and emitted as part of our Airflow hooks. Some details in 🧵

bitter-lizard-32293

06/02/2022, 3:47 PM

Our understanding is that if we choose the dataset URN correctly we should be able to emit different aspects from different emit calls (so thus one emit call to say Iceberg dataset: urnlidataPlatform:iceberg,foo.bar,PROD was the output for a given datajob and another emit call(s) to push the Iceberg metadata for table foo.bar)

bitter-lizard-32293

06/02/2022, 3:47 PM

Couple of questions around this: 1) Is it possible in DataHub to link a couple of datasets (like a unix symbolic link). Our thinking here was in case we registered datasets with our multiple connectors with potentially differing URNs would it be possible to link them together to say Dataset's X and Y are actually the same 2) How have folks dealt with metadata backfills. In case we emit datasets with incorrect metadata / URNs and need to go back and clean things up, what's the best way for us to do so?

bitter-lizard-32293

06/02/2022, 3:47 PM

Our worry with 1 is primarily centered around metadata for our datasets that resides in a source controlled dataset registry (this tracks things like ACLs, retention etc). We'll need to build a custom metadata ingestion source for this (similar in a way to: https://github.com/datahub-project/datahub/tree/master/metadata-integration/java/datahub-protobuf-example). In most cases we should be able to line these URNs up to be consistent with what we emit in Airflow, but some of our pure S3 datasets might be harder (which is why if there's a way to link later it might help a bit..)

bitter-lizard-32293

06/02/2022, 3:49 PM

cc @numerous-byte-87938 as we were chatting about this last week

mammoth-bear-12532

06/03/2022, 11:10 PM

Hey @bitter-lizard-32293, great questions 🙂 I will respond here today for sure.

thank you 1

2 Views

Open in Slack