Hi, Is there anyone who ingest lineages of glue jo...
# ingestion
f
Hi, Is there anyone who ingest lineages of glue job (redshift table to redshift table ETL job) manually? If yes, a sample code would be really helpful for me. Tks I've tried below: • Glue Annotation - Got parsing error
Copy code
Error parsing DAG for Glue job. The script <s3://steadio-glue-info/scripts/test-datahub-lineage.py> cannot be processed by Glue (this usually occurs when it has been user-modified): An error occurred (InvalidInputException) when calling the GetDataflowGraph operation: line 11:87 no viable alternative at input \'## @type: DataSource\\n## @args: [catalog_connection = "redshiftconnection", connection_options = {"database" =\'']}
Dataset job code - have no idea what I need to put for job id and flow id
h
Hi @few-sugar-84064 - have you already ingested the concerned glue job in Datahub someway and only intend to emit lineage manually ?
f
Hi @hundreds-photographer-13496, yes, I've already ingested my all glue jobs by cli but a lineage of each job is not shown cuz the scripts weren't generated automatically. So I'm considering to ingest the lineage manually.
h
in that case, you need not use
builder.builder.make_data_job_urn
to construct urn for data job. You can directly use the urn of glue datajob ingested in DataHub, for which you need to set lineage.
Copy code
from typing import List


from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.com.linkedin.pegasus2avro.datajob import DataJobInputOutputClass
from datahub.metadata.schema_classes import ChangeTypeClass

datajob_input_output = DataJobInputOutputClass(    inputDatasets=["<placeholder for input redshift table urn>"],outputDatasets=["<placeholder for output redshift table urn>"])

datajob_input_output_mcp = MetadataChangeProposalWrapper(
    entityType="dataJob",
    changeType=ChangeTypeClass.UPSERT,
    entityUrn="<placeholder for glue job urn>",
    aspectName="dataJobInputOutput",
    aspect=datajob_input_output,
)

# Create an emitter to the GMS REST API.
emitter = DatahubRestEmitter("<http://localhost:8080>")

# Emit metadata!
emitter.emit_mcp(datajob_input_output_mcp)
f
@hundreds-photographer-13496 Thanks for the code. I found that my glue jobs have only flow urn so I can't ingest metadata with it. Could you advise how to change it to job? I tried to make job urn manually like
urn:li:dataJob:(urn:li:dataFlow:(glue,flow name,PROD),flow name)
and ran the code, it ran successfully, but the glue flow still doesnt have lineage on frontend view.