Good afternoon everyone so I did an addition of a column to DataHub #ui

Good afternoon everyone, so I did an addition of a...

microscopic-mechanic-13766

10/31/2022, 1:26 PM

Good afternoon everyone, so I did an addition of a column to a Hive table via Spark (using beeline inside my Spark container and then connecting to my Hive instance). Everything was executed successfully and the column was added. The thing is that the corresponding task shown in Datahub seem to be a bit bugged. The initial look of the lineage is a bit strange (as 4 tasks appear, 2 before and 2 after the HDFS datasets), but the problem comes when I try to move any element shown (second image).

bulky-soccer-26729

10/31/2022, 2:51 PM

hey Pablo! gotcha yeah, we've seen this issue before and recently pushed a fix for at least one way that this could happen. what version of datahub are you using? it's possible there are multiple ways this situation could occur. I believe this happens when the data we fetch results in duplicate nodes in the graph, and our graph renderer starts bugging out when it sees these duplicates

microscopic-mechanic-13766

10/31/2022, 3:01 PM

Hi @bulky-soccer-26729, so I am currently using v0.9.0. Your hypothesis fits this case as in this case the node that would have to be duplicated is the HDFS dataset. Thanks for the help!!

microscopic-mechanic-13766

10/31/2022, 3:03 PM

It is also bugged in the "Details" section of the lineage, as the HDFS doesn't appear but the "bugged" Spark applications do.

bulky-soccer-26729

10/31/2022, 3:05 PM

of course! and can you explain a bit more about that next bug? in impact analysis for this task you're not seeing the hdfs datasets?

microscopic-mechanic-13766

10/31/2022, 3:18 PM

Yeah, so in the impact analysis, as part of 1 degree dependencies, the HDFS dataset should appear on the down and upstream (which it doesn't). Another "problem" is that the "bugged" Spark applications DO appear as a 2nd degree dependencies.

bulky-soccer-26729

10/31/2022, 3:19 PM

is it the same for both upstream and downstream impact analysis here?

microscopic-mechanic-13766

10/31/2022, 3:20 PM

Yeah, in both cases the information shown is the exact same

big-carpet-38439

10/31/2022, 9:40 PM

Trying to get up to speed here...

bulky-soccer-26729

10/31/2022, 10:13 PM

basically when we have duplicate data in the lineage viz graph - we get these wild arrow situations

bulky-soccer-26729

10/31/2022, 10:13 PM

Pablo, do you know if there's an issue here with your duplicate data?

bulky-soccer-26729

10/31/2022, 10:14 PM

if this is expected, we'll need another approach to solving this bug

microscopic-mechanic-13766

11/03/2022, 1:22 PM

Sorry, just read your answer. Yeah, so basically both of the duplicated Spark tasks should not be there as, when I click in their info, and go to their lineage graph, they point to the one between the hdfs datasets (which is the real Spark task)

microscopic-mechanic-13766

11/03/2022, 1:23 PM

The correct lineage should be like this

microscopic-mechanic-13766

11/03/2022, 1:25 PM

A circular graph from HDFS dataset->Spark Task->Initial HDFS dataset should work too (to avoid redundant info)

Open in Slack

Previous Next