Hey guys! I was doing a load testing the adjustmen...
# integrate-tableau-datahub
Hey guys! I was doing a load testing the adjustments made in Tableau and I identified something a little strange. It seems to me that the ingester has duplicated several datasources. In the image, the charts in question have only one datasource, but there appear two. Another important thing to comment on is that the datasource linked with ODBC doesn't have the correct field descriptions.
Very strange, indeed! Thanks so much for sharing - @hundreds-photographer-13496 please take a look when you have a chance!
Hey @brainy-wall-41694 can you confirm if these two datasources have different urns ? Also what type of datasource is each of it ? (Published / Embedded - it's there is properties as type).
1 - can you confirm if these two datasources have different urns ? Yes. 2 - Also what type of datasource is each of it ? (Published / Embedded - it's there is properties as type). The workbook uses a published data source. The data source which is correct with the description of the fields, data types, is a Published Data source. The one that is incorrect is the one with the type: EmbeddedDatasource
If I can help with anything else just let me know.
Workbook only with embedded data source has no problems.
yes, it seems like when tableau workbook connects to published datasource, it also creates an embedded datasource inside workbook that has published datasource as upstream. Something like Original tables(e.g. Genericodbc) <- Published Datasource <- Embedded datasource <- Sheet This is not correctly captured currently
Can you verify if you see an embedded datasource in your workbook that connects to published datasource with same name? It seems to be in fact possible to rename the embedded datasource name.
Tableau metadata api gives relationships like these: 1. Original tables(e.g. Genericodbc) <- Published Datasource 2. Original tables(e.g. Genericodbc) <- Embedded Datasource 3. Embedded Datasource <- Sheet 4. Published Datasource <- Sheet 5. Published Datasource <- Embedded Datasource PS: A<-B means A is upstream of B
We can either A. capture all of these (to reflect tableau metadata api) or B. capture simpler representation (to reflect tableau UI) by omitting relations 2 and 4 in presence of embedded datasource pointing to published datasource What do you think ?
I understand that option B is better. For the following reasons: - Option A will cause published Tableau data sources to be duplicated, not reflecting reality and causing confusion for both technical and business users. - This duplicity would mean that any documentation would have to be carried out in each of the data sources recognized as "embedded", and in the published data source it would only be done once. - The lineage view of the data/graphs/dashboard would be much cleaner. I think it's cool that everyone who uses tableau also exposes their opinions.
Great catch, @brainy-wall-41694! I agree that B would be simpler, but I'm a bit concerned that we'd lose some information here. (Please note that at this point this is speculation, I have not verified what the metadata API is actually doing.) In particular, Tableau allows for calculated fields (and groups, sets, etc. that are basically glorified calculated fields) to be either part of the published data source or part of the downstream workbook. If this "embedded" data source is Tableau's way of representing calculated fields that are specific to the workbook, we wouldn't want to remove that information from DataHub (it could be incredibly important information for the end users), and representing those fields as being part of the published data source would also be incorrect. If I'm correct and that's what Tableau is doing, I think option A may be a better route. An option C - and I realize this would be a lot more work - may be to basically only populate the embedded data source with fields that are specific to the workbook/embedded data source. That is, it would strip out any fields from the embedded datasource that come directly from the published datasource.
It looks like that's it. But maybe there is something that can help in implementing option C. It seems to me, when the embedded data source "copies" the published data source fields, the "DATASOURCEFIELD" tag is inserted. If the field had only this tag, it could be disregarded for being duplicated. This would cause the calculated fields to be brought into the Datahub without causing duplication. A second possible validation would be not to create this embedded data source when there are only fields with the tag "DATASOURCEFIELD".
I've tested creating a datasource, attaching it to a workbook and retrieving the data using graphql. The only difference seems to be in the fields type: a
while a
. As per the graphQL docs:
Copy code
Data source fields can only exist in embedded data sources which connect to a published data source. A data source field is an embedded data source's 'layered' representation of a field that already exists in the published data source and is mostly a copy of the field in the published data source. Data source fields can get their own descriptions and renames local to the embedded data source, but cannot otherwise be modified in the embedded data source.
I can confirm that an embedded data source field can be renamed and it won't be reflected on the upstream data source.
I understand that it might be confusing, as end users might think that there is a single datasource. But the truth is that it is indeed a different data source that might have different metadata (names or descriptions)
plus1 1
Maybe we can make this configurable at source with a flag and let end users decide which behaviour they desire. Should be also pretty straightforward to implement. What do you think?
I also tried adding a calculated field @early-article-88153 and the behaviour is the same as for the other fields. Seems to be unrelated
btw TIL that there is an embedded graphiql in tableau server, so if you want to check yourselves just replace here your URI and datasource name