Hi folks, does Datahub allow for an Airflow task t...
# getting-started
i
Hi folks, does Datahub allow for an Airflow task to have multiple output Datasets? Using the lineage_backend_demo.py produces the lineage graph shown in the screenshot below. If I modify that file to add two additional outlet datasets, the lineage graph remains the same if the context is centered on the run_data_task Task. If I switch the context to one of the new datasets I added (for example tableG), it shows the lineage from the originating dataset, omitting the the Task. I'm trying to determine if this is simply a bug, or if Tasks aren't intended to have multiple outlet Datasets. I'm inclined to think its just a bug, as the outlet datasets value is an array, though would be good to confirm. Thanks
b
Outputs should indeed be an array. I think if you're not seeing this it's a bug. Do you mind opening a private browsing window and trying again? It could be that the local cache for the downstream lineage is not being updated!
g
Hey @icy-holiday-55016 - airflow jobs should be able to have multiple outputs. The logic for adding those edges is here, and supports adding arrays of output datasets. I do notice in your secreenshots that you are highlighting two different datasets - Table F is the most downstream in the left screenshot but Table G is the most downstream in the right one. Could that be the source of the confusion?
i
hi @big-carpet-38439, i tried opening a private browser window though it didn't make a difference in this case. @green-football-43791 i'm not sure it's an issue of focus. i've added more info below. (also changed the dataset names to be slightly more meaningful) screenshot 1: this is the Airflow DAG with 2 inlets and 1 outlet. screenshot 2: after I execute the DAG, I see the contents of this screenshot in the lineage viewer (with focus on the task). this shows what i expect. screenshot 3: shows the lineage viewer with focus on 'input1' dataset. this screenshot raises a perhaps separate question: should there be two lines from input1? (1 to the task and 1 to output1?). I would have thought there's no direct link between input1 and output1 screenshot 4: this is the same DAG as before, but modified to have an additional output screenshot 5: this shows the lineage viewer with focus on the Task after the DAG has been executed. Visually there is no change from before, I opened this in the private window screenshot 6: shows the lineage viewer with focus on 'input1' dataset. I can now see the new output 'output2'. screenshot 7: shows lineage viewer with focus on the 'input2' dataset. Similar behavior to screenshot 6 It seems that a relationship gets established directly between input1 and output2, with no reference to the Task in the middle. Apologies for the essay, I appreciate the responses
think i added too many screenshots! to amend the above: screenshot 6: shows lineage viewer with focus on the 'input2' dataset. I can now see the new output 'output2'. screenshot 7: shows the lineage viewer with focus on 'input1' dataset. I can now see the new output 'output2'. (attached here)
g
I see @icy-holiday-55016 - thanks for these additional details. Looking more into this.
i
you're welcome. i'm adding some logging to that class you pointed out to me. i'll paste the output of it before i finish up for the day
g
thank you!
one other thing to try- I recently made an update to that file it seems you may not have picked up
it removed the dataset<>dataset connections in the index builders
i
good shout, i'll compare. i'm using version from the 0.7.1 release
g
sounds good- its possible that pulling the latest may help (you will likely also want to get the latest Airflow backend- some job/dataset lineage changes have been made there recently as well)
i will also continue to investigate
i
think that solved it, looks good now (see screenshot 1, it's focused on the task(). thanks for the pointer i'm curious about the behavior in screenshot 2 though: it's focused on input1. it shows a link to both the Task and both the output tasks. i wouldn't expect a direct link between the input and output tasks. what do you think?
g
Yes- that link should not be there anymore
if you use the newest airflow backend
and the newest Datahub graph builders, you should not expect to see those edges.
if you updated both, its possible that those edges are left over from your last ingestion
but they should not be added any longer
i
i don't think its from the last ingestion, as I did a full restart of the stack (which blows away data in the DB) i'll try out the latest airflow backend tomorrow morning, as well as the latest Datahub. off topic a bit, would you happen to know when the next release is getting tagged? we're still quite early in our evaluation of Datahub; for now we're planning to pull in the latest from Github when there's a release
g
We do a release every month- the next one would be at the end of May. However, you shouldn't feel the need to wait until a release to pull if there are features you want. The releases are usually done after significant milestones.
i
yes you're right, we'll work it out internally signing off for the day as it's getting late. really appreciate your help
g
certainly! enjoy your evening.