0.8.32.4 Duplicates in datasets which belongs to o...
# integrate-tableau-datahub
s
0.8.32.4 Duplicates in datasets which belongs to one workbook with different guids one is treated as
Embedded
another one as
Published
with same downstream links to Chart is this expected behavior? i believe should be transient dependency:
Copy code
View -> Published DS -> Embedded DS -> Chart
or one of them (possibly embedded one) should be removed
Tableau lineage looks like this
l
@hundreds-photographer-13496 ^
a
s
bu in mu case these two are also connected )
as far as i can imagine chart should be connected to Embeded embeded to published published to view/table not like now when chart is connected to both embeded/published identically
plus1 1
h
@shy-parrot-64120 you are right about this being the ideal lineage
View -> Published DS -> Embedded DS -> Chart
As discussed over the other thread mentioned by David, tableau metadata api doesn't give clear distinction between immediate upstream datasource vs indirect(transitive) upstream datasource, causing both direct and indirect datasource being shown in DataHub upstreamlineage for chart. We can make certain assumptions and workarounds to fix this. There are already some suggestions on other thread. I'll be doing further exploration and experiments to see how to tackle this problem. Will update here. Let me know if this makes sense.
s
OMG seems like bug in Tableau Lineage sure it makes sense in any case chart will point on embeded datasource which can point to view/ published datasource
h
Well, I doubt that its a bug. It's seems by design that upstreamDatasources returns all the upstreams (direct + indirect) vice versa for downstream.
So here is what I have done to fix this problem: Caveats in current tableau source - • UpstreamDatasources from Sheet object return direct as well as indirect upstreams. (vice-versa for downstreamSheets from Datasource), using is leads to redundant lineage edges causing a complex(less accurate) lineage graph. • upstreamTables for embedded datasource return tables indirectly upstream it via upstream published datasource Proposed Solution - • Use Sheet object’s datasourceFields to find out the datasources immediately upstream of sheet. It is observed and confirmed from graphql docs that datasourceFields are always from an embedded datasource. • For an embedded datasource, ◦ if there are upstream published datasources, emit embedded datasource→upstream published datasource lineage. ◦ Else emit embedded datasource→upstream tables lineage • For a published datasource ◦ always emit published datasource → upstream tables lineage I have created this tableau PR for fixing this and some other problems - https://github.com/datahub-project/datahub/pull/4724
thank you 2
b
Hi! It is perfect. Do you happen to have an estimate of which version this tweak will come in?
h
This PR was recently merged and should come out as part of next datahub release. Will update once release is out.
the lineage tweak is available in datahub version v0.8.34 !! Do try it out.
teamwork 1