Hi folks I ve been playing with the Tableau source for DataH DataHub #integrate-tableau-datahub

Join Slack

Hi folks! I've been playing with the Tableau sourc...

# integrate-tableau-datahub

early-article-88153

02/14/2022, 11:30 PM

Hi folks! I've been playing with the Tableau source for DataHub for a couple of days now and have some feedback: 🧵

early-article-88153

02/14/2022, 11:30 PM

First and foremost, this is awesome. Thank you.

plus1 1

early-article-88153

02/14/2022, 11:32 PM

Some context: we're using this against a production Tableau environment that pulls information from a complex data warehouse. The whole workflow looks like: Oracle-based transactional database -> data transformation tool -> Oracle-based data warehouse -> Tableau published data source -> Tableau worksheets/dashboards.

early-article-88153

02/14/2022, 11:35 PM

The first error I ran into was this:

Copy code

Source (tableau) report:
{'workunits_produced': 0,
 'workunit_ids': [],
 'warnings': {},
 'failures': {'tableau-metadata': ["Unable to retrieve metadata from tableau. Information: Connection: workbooksConnection Error: [{'message': "
                                   "'Showing partial results. The request exceeded the 20000 node limit. Use pagination, additional filtering, or "
                                   "both in the query to adjust results.', 'extensions': {'severity': 'WARNING', 'code': 'NODE_LIMIT_EXCEEDED', "
                                   '\'properties\': {\'nodeLimit\': 20000}}}, {\'message\': "Cannot return null for non-nullable type: '
                                   "'RemoteType' within parent 'Column' "
                                   '(/workbooksConnection/nodes[8]/embeddedDatasources[0]/upstreamTables[0]/columns[30]/remoteType)", \'path\': '
                                   "['workbooksConnection', 'nodes', 8, 'embeddedDatasources', 0, 'upstreamTables', 0, 'columns', 30, 'remoteType'], "
                                   '\'locations\': None, \'errorType\': \'DataFetchingException\', \'extensions\': None}, {\'message\': "Cannot '
                                   "return null for non-nullable type: 'RemoteType' within parent 'Column' "
                                   '(/workbooksConnection/nodes[8]/embeddedDatasources[0]/upstreamTables[0]/columns[31]/remoteType)", \'path\': '
                                   "['workbooksConnection', 'nodes', 8, 'embeddedDatasources', 0, 'upstreamTables', 0, 'columns', 31, 'remoteType'], "
                                   '\'locations\': None, \'errorType\': \'DataFetchingException\', \'extensions\': None}, {\'message\': "Cannot '
                                   "return null for non-nullable type: 'RemoteType' within parent 'Column' "....

repeat the last error message (about

RemoteType

) ad nauseum.

early-article-88153

02/14/2022, 11:38 PM

From what I can tell, the

RemoteType

error is spurious, and what was actually happening here was that requesting information on 10 workbooks at a time was causing Tableau's API to return more data than was allowed. I had to move to requesting 1 workbook at a time, which resolved the errors. In particular, on line 807 of

metadata-ingestion/src/datahub/ingestion/source/tableau.py

I changed:

Copy code

yield from self.emit_workbooks(10)

to:

Copy code

yield from self.emit_workbooks(1)

This fixed the errors I was seeing and allow for ingestion to proceed.

early-article-88153

02/14/2022, 11:41 PM

I did still see one case of this error:

Copy code

Source (tableau) report:
{'workunits_produced': 0,
 'workunit_ids': [],
 'warnings': {},
 'failures': {'tableau-metadata': ["Unable to retrieve metadata from tableau. Information: Connection: workbooksConnection Error: [{'message': "
                                   "'Showing partial results. The request exceeded the 20000 node limit. Use pagination, additional filtering, or "
                                   "both in the query to adjust results.', 'extensions': {'severity': 'WARNING', 'code': 'NODE_LIMIT_EXCEEDED', "
                                   '\'properties\': {\'nodeLimit\': 20000}}}]]}
}

even when requesting a single workbook at a time. However, as a warning this wasn't terribly problematic, but it means I may be missing information for that particular workbook. It might make sense to split up the workbooks query into small queries, if possible.

early-article-88153

02/14/2022, 11:49 PM

The next issue is that the way the Tableau source constructs a URN compared to other ingestion sources. (Keep in mind my frame of reference here is Oracle - there are gobs of data sources I've not tried.) When pulling in information from Oracle I'm getting datasets with URNs that looks like:

Copy code

urn:li:dataset:(urn:li:dataPlatform:oracle,schema.TableName,PROD)

When Tableau constructs a URN from the data source pointing at the same table, it builds it as:

Copy code

urn:li:dataset:(urn:li:dataPlatform:oracle,hostname:1521.SCHEMA.TableName,PROD)

The two particular issues here are the inclusion of the hostname (

upstream_db

in the Tableau files) and the schema being upper case. (Because everything is upper-case in Oracle by default, I'm assuming that the lower-case schema name here is from SQLAlchemy, but that's just an assumption.) In order to get Tableau's URNs to match what I was getting from Oracle, I had to adjust in

metadata-ingestion/src/datahub/ingestion/source/tableau_common.py

around line 400:

Copy code

database_name = f"{upstream_db}." if upstream_db else ""
    schema_name = f"{schema}." if schema else ""

    urn = builder.make_dataset_urn(
        platform, f"{database_name}{schema_name}{final_name}", env
    )

to:

Copy code

database_name = f"{upstream_db}." if upstream_db else ""
    schema_name = f"{schema.lower()}." if schema else ""

    urn = builder.make_dataset_urn(
        platform, f"{schema_name}{final_name}", env
    )

early-article-88153

02/15/2022, 12:03 AM

Finally, the ordering of ingestion from different data sources matters. If I ingest from Oracle first and then ingest from Tableau, Tableau will re-order columns and I lose my indicators of primary and foreign keys. In the long run I think this is less of a concern, but it's probably worth documenting the ordering here.

early-article-88153

02/15/2022, 12:12 AM

If I could make a couple of suggestions: • Configuration options for number of workbooks to query at one time. 10 is probably a reasonable default for most setups but for some more complex data sources may create issues. • Configuration options for URN construction. I don't know enough about DataHub (I'm still new here! Happy to investigate other pieces) to know about which URN construction makes more sense, but there needs to be a way to make sure that datasets originating from different systems are matched up properly. • If a dataset already exists in DataHub, perhaps a configuration option to get update it?

early-article-88153

02/15/2022, 12:13 AM

If any additional information is needed, please do let me know, and I'm happy to do these as GitHub issues, pull requests, etc. - whatever the preference is.

early-article-88153

02/16/2022, 7:28 PM

Having played with this a bunch more today and dug into some stuff, some additional feedback.

early-article-88153

02/16/2022, 7:30 PM

First, the warning I noted above ("exceeded the 20000 node limit", https://datahubspace.slack.com/archives/C02GBGG90CU/p1644882062984109?thread_ts=1644881402.488669&cid=C02GBGG90CU) is being treated by DataHub as an error and it's ceasing processing workbooks when it hits that. In a project with 18 workbooks, it's getting through 2, hitting the error on the third, and only ingesting those 2 into DataHub. So I'd highly suggest refactoring the GraphQL query into smaller pieces - I realize it's ugly and adds complexity but I'm not sure there's any other realistic way around it.

early-article-88153

02/16/2022, 7:38 PM

Second, I think I've found the cause of the "IndexError: list index out of range" error reported earlier (https://datahubspace.slack.com/archives/C02GBGG90CU/p1644507323708139). I hit it today with a workbook that has both hidden worksheets (i.e. sheets in the workbook that are published but not exposed those the web interface) and viz-in-tooltip sheets (i.e. sheets in the workbook only accessible by hovering over marks in another sheet). In both of these cases the "path" element in the sheet structure is empty and the "containedInDashboards" list is empty. The code at line 647 of

metadata-ingestion/src/datahub/ingestion/source/tableau.py

is assuming that AT LEAST one of these elements will have a value, and not dealing with the scenario where both are blank:

Copy code

if sheet.get("path", ""):
                sheet_external_url = f"{self.config.connect_uri}#/site/{<http://self.config.site|self.config.site>}/views/{sheet.get('path', '')}"
            else:
                # sheet contained in dashboard
                dashboard_path = sheet.get("containedInDashboards")[0].get("path", "")
                sheet_external_url = f"{self.config.connect_uri}/t/{<http://self.config.site|self.config.site>}/authoring/{dashboard_path}/{sheet.get('name', '')}"

In this case, the else should be replaced with a check to ensure the length of

sheet.get("containedInDashboards")

is greater than 0, and a new else created that deals with the situation where both are blank.

little-megabyte-1074

02/18/2022, 10:25 PM

Thank you for this amazing feedback, @early-article-88153!! I really appreciate the time & detail you’ve put into this 🙂

58 Views

Open in Slack

Previous Next