I m having a hard time understanding datahub s entities Coul DataHub #advice-metadata-modeling

I’m having a hard time understanding datahub’s ent...

rapid-fireman-19686

06/02/2022, 1:26 AM

I’m having a hard time understanding datahub’s entities. Could anyone explain me relationship between Data platform, Dataset, DataJob and DataFlow from lineage point of view with some kafka and spark examples?

rapid-fireman-19686

06/02/2022, 1:47 AM

Let’s consider that there is a kafka cluster which has a topic. A spark streaming job subscribes the topic and then writes messages from the topic to hdfs. In the scenario above Data platform • Kafka cluster, HDFS Dataset • Kafka topic, Files on HDFS DataFlow • Spark cluster DataJob • Spark Streaming Job

rapid-fireman-19686

06/02/2022, 1:52 AM

To populate lineages on the example

dataJobInputOutput

has to be set to the spark streaming job(DataJob) as its aspect.

rapid-fireman-19686

06/02/2022, 1:53 AM

Can you give some advice if I am wrong please?

modern-artist-55754

06/02/2022, 5:49 AM

@rapid-fireman-19686 https://demo.datahubproject.io/pipelines/urn:li:dataFlow:(spark,orders_cleanup_flow,PROD)/Tasks?is_lineage_mode=false this one probably better explains the difference

modern-artist-55754

06/02/2022, 5:50 AM

What i understand is that, the dataflow representing a pipeline, Datajob is individual task/step in that flow.

hallowed-lizard-92381

10/26/2022, 7:47 PM

Still wondering how to related the DataPlatform to the individual DataFlows. The airflow integration does it - but I can’t figure out how to instantiate or create a new DataPlatform entity during the creation of new DataFlows. Anyone?

2 Views

Open in Slack

Previous Next