Hi Guys, quick question: we successfully importing hive (kerberized) metadata. now we want to update the dataset inside the datahub with lineage information. how to do that? currently we extract the upstream and downstream information from hive sql history.
g
gray-shoe-75895
03/17/2021, 4:45 PM
Hi Anung! Curious to hear how you're extracting the upstream/downstream information from the hive sql history. In any case, you can use metadata-ingestion's emitters to publish that lineage information to DataHub once you extract it https://datahubproject.io/docs/metadata-ingestion/#using-as-a-library
l
loud-island-88694
03/17/2021, 4:47 PM
@gray-shoe-75895 is it correct to say they should specifically emit UpstreamLineage and DownstreamLineage events using the emitter?
g
gray-shoe-75895
03/17/2021, 5:25 PM
Yes that's exactly correct - you can emit a DatasetSnapshot with an UpstreamLineage aspect, and DataHub will match the identifiers/URNs appropriately
c
calm-lawyer-777
03/18/2021, 6:34 AM
Hi @gray-shoe-75895, we are using cloudera distribution, there is an audit log feature (the output is bunch of hive queries history). We use python sql sqllineage · PyPI to get the source and target tables.
g
gray-shoe-75895
03/18/2021, 5:28 PM
Got it - that's pretty nifty! Using the metadata emitters to emit an update with an UpstreamLineage aspect is the way to go here - happy to give guidance as you build it