hello, looking at `pipeline` feature, what entitie...
# ingestion
c
hello, looking at
pipeline
feature, what entities do I need to assemble and ingest to get that working? Is there a doc about that feature?
m
@curved-magazine-23582: This is the DataFlows and DataJobs entities in the metadata graph
Are you asking specifically about
airflow
?
c
we don't use airflow. I am thinking about catalog our AWS Glue ETL jobs to DataHub by custom logic through GMS API. Just now sure what specific entities I need to craft for that to show up in DataHub UI.
m
Take a look at the sample data here: https://github.com/linkedin/datahub/pull/2396/files
c
I see there are DataJob, DataFlow. There is also DataProcess. I guess my questions whether and are how they are related or tied together.
ah, cool, will take a look of sample data
Does DataProcess matter? 🤔
m
DataFlow == scheduled "DAG / Pipeline", DataJob == "DAG node / stage" within the DataFlow
DataProcess == adhoc job
we haven't put a ton of effort on DataProcess ... focusing on DataFlow and DataJob now
c
awesome, thanks for the info!
are DataJobs somehow referenced by DataFlow? I can't find that tie myself. 😞
m
DataJobs describe their relnship to DataFlow. @curved-magazine-23582
c
how does DataJob do that? oh, by ExternalReference? 🤔
m
@curved-magazine-23582: the relnship is included through the urn
e.g.
"urn:li:dataJob:(urn:li:dataFlow:(airflow,dag_abc,PROD),task_123)"
c
ah cool thanks. so dataJob has to be within a dataFlow, I guess
m
correct