Hi folks! I've been playing with the Tableau sourc...
# integrate-tableau-datahub
e
Hi folks! I've been playing with the Tableau source for DataHub for a couple of days now and have some feedback: 🧵
First and foremost, this is awesome. Thank you.
plus1 1
Some context: we're using this against a production Tableau environment that pulls information from a complex data warehouse. The whole workflow looks like: Oracle-based transactional database -> data transformation tool -> Oracle-based data warehouse -> Tableau published data source -> Tableau worksheets/dashboards.
The first error I ran into was this:
Copy code
Source (tableau) report:
{'workunits_produced': 0,
 'workunit_ids': [],
 'warnings': {},
 'failures': {'tableau-metadata': ["Unable to retrieve metadata from tableau. Information: Connection: workbooksConnection Error: [{'message': "
                                   "'Showing partial results. The request exceeded the 20000 node limit. Use pagination, additional filtering, or "
                                   "both in the query to adjust results.', 'extensions': {'severity': 'WARNING', 'code': 'NODE_LIMIT_EXCEEDED', "
                                   '\'properties\': {\'nodeLimit\': 20000}}}, {\'message\': "Cannot return null for non-nullable type: '
                                   "'RemoteType' within parent 'Column' "
                                   '(/workbooksConnection/nodes[8]/embeddedDatasources[0]/upstreamTables[0]/columns[30]/remoteType)", \'path\': '
                                   "['workbooksConnection', 'nodes', 8, 'embeddedDatasources', 0, 'upstreamTables', 0, 'columns', 30, 'remoteType'], "
                                   '\'locations\': None, \'errorType\': \'DataFetchingException\', \'extensions\': None}, {\'message\': "Cannot '
                                   "return null for non-nullable type: 'RemoteType' within parent 'Column' "
                                   '(/workbooksConnection/nodes[8]/embeddedDatasources[0]/upstreamTables[0]/columns[31]/remoteType)", \'path\': '
                                   "['workbooksConnection', 'nodes', 8, 'embeddedDatasources', 0, 'upstreamTables', 0, 'columns', 31, 'remoteType'], "
                                   '\'locations\': None, \'errorType\': \'DataFetchingException\', \'extensions\': None}, {\'message\': "Cannot '
                                   "return null for non-nullable type: 'RemoteType' within parent 'Column' "....
repeat the last error message (about
RemoteType
) ad nauseum.
From what I can tell, the
RemoteType
error is spurious, and what was actually happening here was that requesting information on 10 workbooks at a time was causing Tableau's API to return more data than was allowed. I had to move to requesting 1 workbook at a time, which resolved the errors. In particular, on line 807 of
metadata-ingestion/src/datahub/ingestion/source/tableau.py
I changed:
Copy code
yield from self.emit_workbooks(10)
to:
Copy code
yield from self.emit_workbooks(1)
This fixed the errors I was seeing and allow for ingestion to proceed.
I did still see one case of this error:
Copy code
Source (tableau) report:
{'workunits_produced': 0,
 'workunit_ids': [],
 'warnings': {},
 'failures': {'tableau-metadata': ["Unable to retrieve metadata from tableau. Information: Connection: workbooksConnection Error: [{'message': "
                                   "'Showing partial results. The request exceeded the 20000 node limit. Use pagination, additional filtering, or "
                                   "both in the query to adjust results.', 'extensions': {'severity': 'WARNING', 'code': 'NODE_LIMIT_EXCEEDED', "
                                   '\'properties\': {\'nodeLimit\': 20000}}}]]}
}
even when requesting a single workbook at a time. However, as a warning this wasn't terribly problematic, but it means I may be missing information for that particular workbook. It might make sense to split up the workbooks query into small queries, if possible.
The next issue is that the way the Tableau source constructs a URN compared to other ingestion sources. (Keep in mind my frame of reference here is Oracle - there are gobs of data sources I've not tried.) When pulling in information from Oracle I'm getting datasets with URNs that looks like:
Copy code
urn:li:dataset:(urn:li:dataPlatform:oracle,schema.TableName,PROD)
When Tableau constructs a URN from the data source pointing at the same table, it builds it as:
Copy code
urn:li:dataset:(urn:li:dataPlatform:oracle,hostname:1521.SCHEMA.TableName,PROD)
The two particular issues here are the inclusion of the hostname (
upstream_db
in the Tableau files) and the schema being upper case. (Because everything is upper-case in Oracle by default, I'm assuming that the lower-case schema name here is from SQLAlchemy, but that's just an assumption.) In order to get Tableau's URNs to match what I was getting from Oracle, I had to adjust in
metadata-ingestion/src/datahub/ingestion/source/tableau_common.py
around line 400:
Copy code
database_name = f"{upstream_db}." if upstream_db else ""
    schema_name = f"{schema}." if schema else ""

    urn = builder.make_dataset_urn(
        platform, f"{database_name}{schema_name}{final_name}", env
    )
to:
Copy code
database_name = f"{upstream_db}." if upstream_db else ""
    schema_name = f"{schema.lower()}." if schema else ""

    urn = builder.make_dataset_urn(
        platform, f"{schema_name}{final_name}", env
    )
Finally, the ordering of ingestion from different data sources matters. If I ingest from Oracle first and then ingest from Tableau, Tableau will re-order columns and I lose my indicators of primary and foreign keys. In the long run I think this is less of a concern, but it's probably worth documenting the ordering here.
If I could make a couple of suggestions: • Configuration options for number of workbooks to query at one time. 10 is probably a reasonable default for most setups but for some more complex data sources may create issues. • Configuration options for URN construction. I don't know enough about DataHub (I'm still new here! Happy to investigate other pieces) to know about which URN construction makes more sense, but there needs to be a way to make sure that datasets originating from different systems are matched up properly. • If a dataset already exists in DataHub, perhaps a configuration option to get update it?
If any additional information is needed, please do let me know, and I'm happy to do these as GitHub issues, pull requests, etc. - whatever the preference is.
Having played with this a bunch more today and dug into some stuff, some additional feedback.
First, the warning I noted above ("exceeded the 20000 node limit", https://datahubspace.slack.com/archives/C02GBGG90CU/p1644882062984109?thread_ts=1644881402.488669&cid=C02GBGG90CU) is being treated by DataHub as an error and it's ceasing processing workbooks when it hits that. In a project with 18 workbooks, it's getting through 2, hitting the error on the third, and only ingesting those 2 into DataHub. So I'd highly suggest refactoring the GraphQL query into smaller pieces - I realize it's ugly and adds complexity but I'm not sure there's any other realistic way around it.
Second, I think I've found the cause of the "IndexError: list index out of range" error reported earlier (https://datahubspace.slack.com/archives/C02GBGG90CU/p1644507323708139). I hit it today with a workbook that has both hidden worksheets (i.e. sheets in the workbook that are published but not exposed those the web interface) and viz-in-tooltip sheets (i.e. sheets in the workbook only accessible by hovering over marks in another sheet). In both of these cases the "path" element in the sheet structure is empty and the "containedInDashboards" list is empty. The code at line 647 of
metadata-ingestion/src/datahub/ingestion/source/tableau.py
is assuming that AT LEAST one of these elements will have a value, and not dealing with the scenario where both are blank:
Copy code
if sheet.get("path", ""):
                sheet_external_url = f"{self.config.connect_uri}#/site/{<http://self.config.site|self.config.site>}/views/{sheet.get('path', '')}"
            else:
                # sheet contained in dashboard
                dashboard_path = sheet.get("containedInDashboards")[0].get("path", "")
                sheet_external_url = f"{self.config.connect_uri}/t/{<http://self.config.site|self.config.site>}/authoring/{dashboard_path}/{sheet.get('name', '')}"
In this case, the else should be replaced with a check to ensure the length of
sheet.get("containedInDashboards")
is greater than 0, and a new else created that deals with the situation where both are blank.
l
Thank you for this amazing feedback, @early-article-88153!! I really appreciate the time & detail you’ve put into this 🙂