Hi Guys quick question we successfully importing hive kerber DataHub #ingestion

Hi Guys, quick question: we successfully importing...

calm-lawyer-777

03/17/2021, 11:19 AM

Hi Guys, quick question: we successfully importing hive (kerberized) metadata. now we want to update the dataset inside the datahub with lineage information. how to do that? currently we extract the upstream and downstream information from hive sql history.

gray-shoe-75895

03/17/2021, 4:45 PM

Hi Anung! Curious to hear how you're extracting the upstream/downstream information from the hive sql history. In any case, you can use metadata-ingestion's emitters to publish that lineage information to DataHub once you extract it https://datahubproject.io/docs/metadata-ingestion/#using-as-a-library

loud-island-88694

03/17/2021, 4:47 PM

@gray-shoe-75895 is it correct to say they should specifically emit UpstreamLineage and DownstreamLineage events using the emitter?

gray-shoe-75895

03/17/2021, 5:25 PM

Yes that's exactly correct - you can emit a DatasetSnapshot with an UpstreamLineage aspect, and DataHub will match the identifiers/URNs appropriately

calm-lawyer-777

03/18/2021, 6:34 AM

Hi @gray-shoe-75895, we are using cloudera distribution, there is an audit log feature (the output is bunch of hive queries history). We use python sql sqllineage · PyPI to get the source and target tables.

gray-shoe-75895

03/18/2021, 5:28 PM

Got it - that's pretty nifty! Using the metadata emitters to emit an update with an UpstreamLineage aspect is the way to go here - happy to give guidance as you build it

calm-lawyer-777

03/19/2021, 5:26 AM

thank you Harshal

Open in Slack

Previous Next