Hello! We’re looking at how DataHub can support t...
# advice-metadata-modeling
a
Hello! We’re looking at how DataHub can support tracking detailed operational metadata. Essentially we need to be able to get lineage between each run of a data pipeline / task and the datasets they produce (for the simple case where each task produces a new immutable dataset at each run). Does DataHub support this out of the box? Looking at the integrations with Apache Spark or Airflow the mapping seems to be:
Copy code
Spark App / Airflow DAG run -> DataPipeline
and
Copy code
Spark Job / Airflow Task run -> DataJob
but since the same urn is emitted each time, what you get is updates of the same DataPipeline or DataJob entity for each run. So this means there is no way of seeing which exact run (update) generated which Dataset because all datasets created by a task will point to the same DataJob entity. Are we missing something? It seems like the new Timeline API released as part of v0.8.28 would address this if it gets implemented for DataPipelines and DataJobs as well. Is that assumption correct? Any idea how long until that will get rolled out?
l
@aloof-arm-38044 this is a very timely question. @dazzling-judge-80093 and @big-carpet-38439 are looking into solving this exact problem
at least for Airflow initially
a
Great to hear @loud-island-88694! Where can we best track this work?
Also, 2 follow-up questions: 1. Does this mean that after this work is done all the pieces will be in place for integrating DataHub with OpenLineage? Any plans to do that yet? 2. Above we spoke about lineage from Task runs to
immutable
Datasets. But DataHub also supports tracking changes in
mutable
Datasets, such as schema changes (via the latest timeline API). Is there a plan to provide lineage from a Task run to each version of a mutable dataset as it evolves over time?
l
for 1) The work we're doing is more with a goal of creating a consistent versioned graph so that each edge correctly represents the state of the world at that point. We also intend to support partitioned datasets and associate runs with specific partitions etc. There may be some overlap for just pipelines with what other systems like open lineage are doing but the goal is broader. At this stage, there are no active plans around open lineage For 2), it is actually a tricky problem since we observe changes to datasets and pipelines at different points in time. So, even if a dataset has mutated, we may not yet have observed it and may have already observed the pipeline run that consumes the newer schema. May be I'm missing something fundamental and @mammoth-bear-12532 has thought about this problem more deeply
the airflow emitter probably has to emit some kind of correlation ID so that we can fix up the edges after the fact
m
Hey @aloof-arm-38044 just a quick note here that we're thinking about how to create "happened-before" relationships correctly when connecting up push-based metadata from emitters with pull-based metadata from crawlers... would be cool to get your thoughts on it as well... my current thinking is that some sources will actually provide the actual timestamp of the schema change from the source catalog directly, so we can use that to correlate across job runs and technical schema changes. In cases where this is not present, we would have to provide error bounds based on the frequency of the ingestion.