early-article-88153
02/14/2022, 11:30 PMearly-article-88153
02/14/2022, 11:30 PMearly-article-88153
02/14/2022, 11:32 PMearly-article-88153
02/14/2022, 11:35 PMSource (tableau) report:
{'workunits_produced': 0,
'workunit_ids': [],
'warnings': {},
'failures': {'tableau-metadata': ["Unable to retrieve metadata from tableau. Information: Connection: workbooksConnection Error: [{'message': "
"'Showing partial results. The request exceeded the 20000 node limit. Use pagination, additional filtering, or "
"both in the query to adjust results.', 'extensions': {'severity': 'WARNING', 'code': 'NODE_LIMIT_EXCEEDED', "
'\'properties\': {\'nodeLimit\': 20000}}}, {\'message\': "Cannot return null for non-nullable type: '
"'RemoteType' within parent 'Column' "
'(/workbooksConnection/nodes[8]/embeddedDatasources[0]/upstreamTables[0]/columns[30]/remoteType)", \'path\': '
"['workbooksConnection', 'nodes', 8, 'embeddedDatasources', 0, 'upstreamTables', 0, 'columns', 30, 'remoteType'], "
'\'locations\': None, \'errorType\': \'DataFetchingException\', \'extensions\': None}, {\'message\': "Cannot '
"return null for non-nullable type: 'RemoteType' within parent 'Column' "
'(/workbooksConnection/nodes[8]/embeddedDatasources[0]/upstreamTables[0]/columns[31]/remoteType)", \'path\': '
"['workbooksConnection', 'nodes', 8, 'embeddedDatasources', 0, 'upstreamTables', 0, 'columns', 31, 'remoteType'], "
'\'locations\': None, \'errorType\': \'DataFetchingException\', \'extensions\': None}, {\'message\': "Cannot '
"return null for non-nullable type: 'RemoteType' within parent 'Column' "....
repeat the last error message (about RemoteType
) ad nauseum.early-article-88153
02/14/2022, 11:38 PMRemoteType
error is spurious, and what was actually happening here was that requesting information on 10 workbooks at a time was causing Tableau's API to return more data than was allowed. I had to move to requesting 1 workbook at a time, which resolved the errors. In particular, on line 807 of metadata-ingestion/src/datahub/ingestion/source/tableau.py
I changed:
yield from self.emit_workbooks(10)
to:
yield from self.emit_workbooks(1)
This fixed the errors I was seeing and allow for ingestion to proceed.early-article-88153
02/14/2022, 11:41 PMSource (tableau) report:
{'workunits_produced': 0,
'workunit_ids': [],
'warnings': {},
'failures': {'tableau-metadata': ["Unable to retrieve metadata from tableau. Information: Connection: workbooksConnection Error: [{'message': "
"'Showing partial results. The request exceeded the 20000 node limit. Use pagination, additional filtering, or "
"both in the query to adjust results.', 'extensions': {'severity': 'WARNING', 'code': 'NODE_LIMIT_EXCEEDED', "
'\'properties\': {\'nodeLimit\': 20000}}}]]}
}
even when requesting a single workbook at a time. However, as a warning this wasn't terribly problematic, but it means I may be missing information for that particular workbook. It might make sense to split up the workbooks query into small queries, if possible.early-article-88153
02/14/2022, 11:49 PMurn:li:dataset:(urn:li:dataPlatform:oracle,schema.TableName,PROD)
When Tableau constructs a URN from the data source pointing at the same table, it builds it as:
urn:li:dataset:(urn:li:dataPlatform:oracle,hostname:1521.SCHEMA.TableName,PROD)
The two particular issues here are the inclusion of the hostname (upstream_db
in the Tableau files) and the schema being upper case. (Because everything is upper-case in Oracle by default, I'm assuming that the lower-case schema name here is from SQLAlchemy, but that's just an assumption.) In order to get Tableau's URNs to match what I was getting from Oracle, I had to adjust in metadata-ingestion/src/datahub/ingestion/source/tableau_common.py
around line 400:
database_name = f"{upstream_db}." if upstream_db else ""
schema_name = f"{schema}." if schema else ""
urn = builder.make_dataset_urn(
platform, f"{database_name}{schema_name}{final_name}", env
)
to:
database_name = f"{upstream_db}." if upstream_db else ""
schema_name = f"{schema.lower()}." if schema else ""
urn = builder.make_dataset_urn(
platform, f"{schema_name}{final_name}", env
)
early-article-88153
02/15/2022, 12:03 AMearly-article-88153
02/15/2022, 12:12 AMearly-article-88153
02/15/2022, 12:13 AMearly-article-88153
02/16/2022, 7:28 PMearly-article-88153
02/16/2022, 7:30 PMearly-article-88153
02/16/2022, 7:38 PMmetadata-ingestion/src/datahub/ingestion/source/tableau.py
is assuming that AT LEAST one of these elements will have a value, and not dealing with the scenario where both are blank:
if sheet.get("path", ""):
sheet_external_url = f"{self.config.connect_uri}#/site/{<http://self.config.site|self.config.site>}/views/{sheet.get('path', '')}"
else:
# sheet contained in dashboard
dashboard_path = sheet.get("containedInDashboards")[0].get("path", "")
sheet_external_url = f"{self.config.connect_uri}/t/{<http://self.config.site|self.config.site>}/authoring/{dashboard_path}/{sheet.get('name', '')}"
In this case, the else should be replaced with a check to ensure the length of sheet.get("containedInDashboards")
is greater than 0, and a new else created that deals with the situation where both are blank.little-megabyte-1074