Hi I found all ingested glue jobs have dataFlow urn only doe DataHub #ingestion

Hi, I found all ingested glue jobs have dataFlow u...

few-sugar-84064

09/28/2022, 2:36 AM

Hi, I found all ingested glue jobs have dataFlow urn only, doesn't have job urn. So I can't see lineage even for the job with auto generated script. Is there anyone knows how to update glue jobs as Datajob, not DataFlow? Below is my yaml ingested the jobs, tks.

Copy code

source:
  type: glue
  config:
    aws_region: "ap-northeast-2"
    extract_transforms: True
    catalog_id: "catalog_id"

sink:
  type: "datahub-rest"
  config:
    server: "gms sever address"

gray-shoe-75895

09/28/2022, 7:08 PM

The glue source currently parses your autogenerated scripts, and each script maps to a DataHub dataFlow with multiple dataJobs nested inside

gray-shoe-75895

09/28/2022, 7:10 PM

Could you provide some more detail on how your setup looks and what you’re looking to see in DataHub?

few-sugar-84064

09/29/2022, 12:57 AM

@gray-shoe-75895 Currently, my all glue jobs processing ETL from Redshift tables to a Redshift table with user defined queries. Therefore, Datahub can't detect each job in the scripts, so I was trying to find the way to change a whole job to a DataJob, not a DataFlow. But seems no way to do it, so I just manually created a job with a same name with a flow to ingest lineage by myself. If there's any better idea, please advise, thanks for your response.

gray-shoe-75895

09/29/2022, 11:17 PM

To make sure I understand, each of your glue jobs has a single task, and that task does some ETL work - is that right? I think in that case you’ve got the best workaround already

Open in Slack

Previous Next