I’m having a hard time understanding datahub’s ent...
# advice-metadata-modeling
r
I’m having a hard time understanding datahub’s entities. Could anyone explain me relationship between Data platform, Dataset, DataJob and DataFlow from lineage point of view with some kafka and spark examples?
Let’s consider that there is a kafka cluster which has a topic. A spark streaming job subscribes the topic and then writes messages from the topic to hdfs. In the scenario above Data platform • Kafka cluster, HDFS Dataset • Kafka topic, Files on HDFS DataFlow • Spark cluster DataJob • Spark Streaming Job
To populate lineages on the example
dataJobInputOutput
has to be set to the spark streaming job(DataJob) as its aspect.
Can you give some advice if I am wrong please?
m
What i understand is that, the dataflow representing a pipeline, Datajob is individual task/step in that flow.
h
Still wondering how to related the DataPlatform to the individual DataFlows. The airflow integration does it - but I can’t figure out how to instantiate or create a new DataPlatform entity during the creation of new DataFlows. Anyone?