Hello everyone. I have a pipeline from which I wo...
# ingestion
d
Hello everyone. I have a pipeline from which I would like to extract the lineage, the pipeline consists of the following components: AWS s3 buckets AWS Glue jobs (pyspark) AWS Redshift All this orchestrated by AWS MWAA (Airflow). So far I have managed to visualize the lineage of the s3, Redshift and Glue jobs (although the latter was a bit difficult) and I wanted to try to get the lineage of Airflow, taking into account that the Airflow tasks are all of type AwsGlueJobOperator. Since we have Airflow operated by AWS we are not allowed to do the backend lineage due to version incompatibility so I plan to try to get it with the help of the DatahubEmitterOperator. My questions are: 1. Is it possible to tell the lineage emitter that both its upstream and downstream tasks are AwsGlueJobOperator type tasks? 2. If this is not possible, could it be done with the Spline-spark-agent to extract the data from the Glue jobs?
o
Hi! The. DatahubEmitterOperator takes full MCEs as arguments to emit. If you have your AwsGlueJobOperators set up as DataJobs in Datahub, you could instead construct an MCE that adds a DataJobInputOutput aspect with the corresponding producer/consumer Urns.
m
@dazzling-cat-48477: have you taken a look at the
datahub-spark-lineage
jars
To answer your original question about custom emission of lineage, @orange-night-91387 is right. Here is sample code that emits [dataset] -> [job] -> [dataset] lineage.
d
@mammoth-bear-12532/@orange-night-91387 thank you both for responding. And I had checked the Spark integration but stopped because of the postgres source limitation :
Only postgres supported for JDBC sources in this initial release
In my case, i also have Redshift as JDBC source And about custom emission of lineage, i'll check that approach and get back to you.
b
Hey @dazzling-cat-48477, Can you please share with us how you could extract lineage from Redshift