Hello everyone I have a pipeline from which I would like to DataHub #ingestion

Hello everyone. I have a pipeline from which I wo...

dazzling-cat-48477

02/02/2022, 5:36 PM

Hello everyone. I have a pipeline from which I would like to extract the lineage, the pipeline consists of the following components: AWS s3 buckets AWS Glue jobs (pyspark) AWS Redshift All this orchestrated by AWS MWAA (Airflow). So far I have managed to visualize the lineage of the s3, Redshift and Glue jobs (although the latter was a bit difficult) and I wanted to try to get the lineage of Airflow, taking into account that the Airflow tasks are all of type AwsGlueJobOperator. Since we have Airflow operated by AWS we are not allowed to do the backend lineage due to version incompatibility so I plan to try to get it with the help of the DatahubEmitterOperator. My questions are: 1. Is it possible to tell the lineage emitter that both its upstream and downstream tasks are AwsGlueJobOperator type tasks? 2. If this is not possible, could it be done with the Spline-spark-agent to extract the data from the Glue jobs?

orange-night-91387

02/02/2022, 6:42 PM

Hi! The. DatahubEmitterOperator takes full MCEs as arguments to emit. If you have your AwsGlueJobOperators set up as DataJobs in Datahub, you could instead construct an MCE that adds a DataJobInputOutput aspect with the corresponding producer/consumer Urns.

mammoth-bear-12532

02/02/2022, 8:24 PM

@dazzling-cat-48477: have you taken a look at the

datahub-spark-lineage

jars

mammoth-bear-12532

02/02/2022, 8:27 PM

To answer your original question about custom emission of lineage, @orange-night-91387 is right. Here is sample code that emits [dataset] -> [job] -> [dataset] lineage.

dazzling-cat-48477

02/02/2022, 11:21 PM

@mammoth-bear-12532/@orange-night-91387 thank you both for responding. And I had checked the Spark integration but stopped because of the postgres source limitation :

Only postgres supported for JDBC sources in this initial release

In my case, i also have Redshift as JDBC source And about custom emission of lineage, i'll check that approach and get back to you.

brainy-needle-61527

05/26/2023, 8:19 AM

Hey @dazzling-cat-48477, Can you please share with us how you could extract lineage from Redshift

3 Views

Open in Slack

Previous Next