Good afternoon everyone, so I did an addition of a...
# ui
m
Good afternoon everyone, so I did an addition of a column to a Hive table via Spark (using beeline inside my Spark container and then connecting to my Hive instance). Everything was executed successfully and the column was added. The thing is that the corresponding task shown in Datahub seem to be a bit bugged. The initial look of the lineage is a bit strange (as 4 tasks appear, 2 before and 2 after the HDFS datasets), but the problem comes when I try to move any element shown (second image).
b
hey Pablo! gotcha yeah, we've seen this issue before and recently pushed a fix for at least one way that this could happen. what version of datahub are you using? it's possible there are multiple ways this situation could occur. I believe this happens when the data we fetch results in duplicate nodes in the graph, and our graph renderer starts bugging out when it sees these duplicates
m
Hi @bulky-soccer-26729, so I am currently using v0.9.0. Your hypothesis fits this case as in this case the node that would have to be duplicated is the HDFS dataset. Thanks for the help!!
It is also bugged in the "Details" section of the lineage, as the HDFS doesn't appear but the "bugged" Spark applications do.
b
of course! and can you explain a bit more about that next bug? in impact analysis for this task you're not seeing the hdfs datasets?
m
Yeah, so in the impact analysis, as part of 1 degree dependencies, the HDFS dataset should appear on the down and upstream (which it doesn't). Another "problem" is that the "bugged" Spark applications DO appear as a 2nd degree dependencies.
b
is it the same for both upstream and downstream impact analysis here?
m
Yeah, in both cases the information shown is the exact same
b
Trying to get up to speed here...
b
basically when we have duplicate data in the lineage viz graph - we get these wild arrow situations
Pablo, do you know if there's an issue here with your duplicate data?
if this is expected, we'll need another approach to solving this bug
m
Sorry, just read your answer. Yeah, so basically both of the duplicated Spark tasks should not be there as, when I click in their info, and go to their lineage graph, they point to the one between the hdfs datasets (which is the real Spark task)
The correct lineage should be like this
A circular graph from HDFS dataset->Spark Task->Initial HDFS dataset should work too (to avoid redundant info)