Hi. When we try to ingest snowflake database, inge...
# ingestion
c
Hi. When we try to ingest snowflake database, ingestion is failing.
Copy code
'ProgrammingError: (snowflake.connector.errors.ProgrammingError) 090105 (22000): Cannot perform SELECT. This session does not have a '
           "current database. Call 'USE DATABASE', or use a qualified name.\n"
           '[SQL: \n'
           'select table_catalog, table_schema, table_name\n'
           'from information_schema.tables\n'
           "where last_altered >= to_timestamp_ltz(1663086530849, 3) and table_type= 'BASE TABLE'\n"
           '            ]\n'
           '(Background on this error at: <http://sqlalche.me/e/13/f405>)\n'
           '[2022-09-14 16:28:52,024] INFO     {datahub.entrypoints:187} - DataHub CLI version: 0.8.41 at '
           '/tmp/datahub/ingest/venv-snowflake-0.8.41/lib/python3.9/site-packages/datahub/__init__.py\n'
           '[2022-09-14 16:28:52,024] INFO     {datahub.entrypoints:190} - Python version: 3.9.9 (main, Dec 21 2021, 10:03:34) \n'
           '[GCC 10.2.1 20210110] at /tmp/datahub/ingest/venv-snowflake-0.8.41/bin/python3 on '
           'Linux-5.4.196-108.356.amzn2.x86_64-x86_64-with-glibc2.31\n'
           "[2022-09-14 16:28:52,024] INFO     {datahub.entrypoints:193} - GMS config {'models': {}, 'versions': {'linkedin/datahub': {'version': "
           "'v0.8.42', 'commit': '4f35a6c43dcd058e4e85b1ed7e4818100ab224e0'}}, 'managedIngestion': {'defaultCliVersion': '0.8.41', 'enabled': True}, "
           "'statefulIngestionCapable': True, 'supportsImpactAnalysis': True, 'telemetry': {'enabledCli': True, 'enabledIngestion': False}, "
           "'datasetUrnNameCasing': False, 'retention': 'true', 'datahub': {'serverType': 'prod'}, 'noCode': 'true'}\n",
           "2022-09-14 16:28:53.137401 [exec_id=2dc5382a-f673-489f-b9bf-4cf1328b7bf7] INFO: Failed to execute 'datahub ingest'",
           '2022-09-14 16:28:53.137719 [exec_id=2dc5382a-f673-489f-b9bf-4cf1328b7bf7] INFO: Caught exception EXECUTING '
           'task_id=2dc5382a-f673-489f-b9bf-4cf1328b7bf7, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/default_executor.py", line 122, in execute_task\n'
           '    self.event_loop.run_until_complete(task_future)\n'
           '  File "/usr/local/lib/python3.9/site-packages/nest_asyncio.py", line 89, in run_until_complete\n'
           '    return f.result()\n'
           '  File "/usr/local/lib/python3.9/asyncio/futures.py", line 201, in result\n'
           '    raise self._exception\n'
           '  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 256, in __step\n'
           '    result = coro.send(None)\n'
           '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 112, in execute\n'
           '    raise TaskError("Failed to execute \'datahub ingest\'")\n'
           "acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"]}
Execution finished with errors.
Here is recipe. source: type: snowflake config: username: xxxxx password: xxxxx role: xxxx warehouse: xxxxx check_role_grants: true account_id: xxxxx include_table_lineage: true include_view_lineage: true ignore_start_time_lineage: true upstream_lineage_in_report: true profiling: enabled: true stateful_ingestion: enabled: true database_pattern: allow: - SNOWFLAKE schema_pattern: allow: - ACCOUNT_USAGE pipeline_name: 'urnlidataHubIngestionSource:xxxxxxxxxxxxxxxxxxxxxxxxxxx'
g
It looks like you’re on version 0.8.41 or 0.8.42, which is a few months old. Could you try updating? If using the UI-based ingestion, you can set version to 0.8.44.2 in the advanced panel of step 4
c
Hi Harshal, Thanks for getting back on this. I am running ingestion using UI, I reran it by setting version to .08.44.2 in step 4, it still failed.
Copy code
'sqlalchemy.exc.ProgrammingError: (snowflake.connector.errors.ProgrammingError) 090105 (22000): Cannot perform SELECT. This session does '
           "not have a current database. Call 'USE DATABASE', or use a qualified name.\n"
           '[SQL: \n'
           'select table_catalog, table_schema, table_name\n'
           'from information_schema.tables\n'
           "where last_altered >= to_timestamp_ltz(1663105153351, 3) and table_type= 'BASE TABLE'\n"
           '            ]\n'
           '(Background on this error at: <http://sqlalche.me/e/13/f405>)\n'
           '[2022-09-14 21:39:14,772] ERROR    {datahub.entrypoints:195} - Command failed: \n'
           '\t(snowflake.connector.errors.ProgrammingError) 090105 (22000): Cannot perform SELECT. This session does not have a current database. '
           "Call 'USE DATABASE', or use a qualified name.\n"
           '[SQL: \n'
           'select table_catalog, table_schema, table_name\n'
           'from information_schema.tables\n'
           "where last_altered >= to_timestamp_ltz(1663105153351, 3) and table_type= 'BASE TABLE'\n"
           '            ]\n'
           '(Background on this error at: <http://sqlalche.me/e/13/f405>) due to \n'
           "\t\t'090105 (22000): Cannot perform SELECT. This session does not have a current database. Call 'USE DATABASE', or use a qualified "
           "name.'.\n"
           '\tRun with --debug to get full stacktrace.\n'
           "\te.g. 'datahub --debug ingest run -c /tmp/datahub/ingest/bb16b21b-b079-4da0-9d73-2869c7d71535/recipe.yml --report-to "
           "/tmp/datahub/ingest/bb16b21b-b079-4da0-9d73-2869c7d71535/ingestion_report.json'\n",
           "2022-09-14 21:39:15.479957 [exec_id=bb16b21b-b079-4da0-9d73-2869c7d71535] INFO: Failed to execute 'datahub ingest'",
           '2022-09-14 21:39:15.480285 [exec_id=bb16b21b-b079-4da0-9d73-2869c7d71535] INFO: Caught exception EXECUTING '
           'task_id=bb16b21b-b079-4da0-9d73-2869c7d71535, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/default_executor.py", line 122, in execute_task\n'
           '    self.event_loop.run_until_complete(task_future)\n'
           '  File "/usr/local/lib/python3.9/site-packages/nest_asyncio.py", line 89, in run_until_complete\n'
           '    return f.result()\n'
           '  File "/usr/local/lib/python3.9/asyncio/futures.py", line 201, in result\n'
           '    raise self._exception\n'
           '  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 256, in __step\n'
           '    result = coro.send(None)\n'
           '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 112, in execute\n'
           '    raise TaskError("Failed to execute \'datahub ingest\'")\n'
           "acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"],
I checked in Snowflake, I can see below queries failed.
SHOW /* sqlalchemy:get_table_names */ TABLES IN account_usage -- Error: SQL compilation error: Object does not exist, or operation cannot be performed. SHOW /* sqlalchemy:get_view_names */ VIEWS IN account_usage -- Error: SQL compilation error: Object does not exist, or operation cannot be performed. select table_catalog, table_schema, table_name from information_schema.tables where last_altered >= to_timestamp_ltz(1663105153351, 3) and table_type= 'BASE TABLE' -- Error: Cannot perform SELECT. This session does not have a current database. Call 'USE DATABASE', or use a qualified name.
g
This definitely looks like a bug cc @hundreds-photographer-13496. Out of curiosity, could you try running with the
snowflake-beta
source and see if that one works?
You might need to hit the “yaml” button in the recipe config dialog and edit it manually
c
Which parameter exactly you want me to update?
g
Set the source’s
type: snowflake-beta
let me know if you need help with it
c
I updated and it failed in less than 1 minute.
g
Was the error the same?
c
Copy code
'pydantic.error_wrappers.ValidationError: 1 validation error for SnowflakeV2Config\n'
           '__root__\n'
           '  Check role grants is not supported. Set `check_role_grants` to False. (type=value_error)\n'
           '\n'
           'The above exception was the direct cause of the following exception:\n'
           '\n'
           'Traceback (most recent call last):\n'
           '  File "/tmp/datahub/ingest/venv-snowflake-beta-0.8.44.2/lib/python3.9/site-packages/datahub/cli/ingest_cli.py", line 197, in run\n'
           '    pipeline = Pipeline.create(\n'
           '  File "/tmp/datahub/ingest/venv-snowflake-beta-0.8.44.2/lib/python3.9/site-packages/datahub/ingestion/run/pipeline.py", line 317, in '
           'create\n'
           '    return cls(\n'
           '  File "/tmp/datahub/ingest/venv-snowflake-beta-0.8.44.2/lib/python3.9/site-packages/datahub/ingestion/run/pipeline.py", line 202, in '
           '__init__\n'
           '    self._record_initialization_failure(\n'
           '  File "/tmp/datahub/ingest/venv-snowflake-beta-0.8.44.2/lib/python3.9/site-packages/datahub/ingestion/run/pipeline.py", line 129, in '
           '_record_initialization_failure\n'
           '    raise PipelineInitError(msg) from e\n'
           'datahub.ingestion.run.pipeline.PipelineInitError: Failed to configure source (snowflake-beta)\n'
           '[2022-09-14 21:52:36,879] ERROR    {datahub.entrypoints:195} - Command failed: \n'
           '\tFailed to configure source (snowflake-beta) due to \n'
           "\t\t'1 validation error for SnowflakeV2Config\n"
           '__root__\n'
           "  Check role grants is not supported. Set `check_role_grants` to False. (type=value_error)'.\n"
           '\tRun with --debug to get full stacktrace.\n'
           "\te.g. 'datahub --debug ingest run -c /tmp/datahub/ingest/47750afb-ae23-42c1-9dbd-4d9fa5017151/recipe.yml --report-to "
           "/tmp/datahub/ingest/47750afb-ae23-42c1-9dbd-4d9fa5017151/ingestion_report.json'\n",
           "2022-09-14 21:52:37.215010 [exec_id=47750afb-ae23-42c1-9dbd-4d9fa5017151] INFO: Failed to execute 'datahub ingest'",
           '2022-09-14 21:52:37.215353 [exec_id=47750afb-ae23-42c1-9dbd-4d9fa5017151] INFO: Caught exception EXECUTING '
           'task_id=47750afb-ae23-42c1-9dbd-4d9fa5017151, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/default_executor.py", line 122, in execute_task\n'
           '    self.event_loop.run_until_complete(task_future)\n'
           '  File "/usr/local/lib/python3.9/site-packages/nest_asyncio.py", line 89, in run_until_complete\n'
           '    return f.result()\n'
           '  File "/usr/local/lib/python3.9/asyncio/futures.py", line 201, in result\n'
           '    raise self._exception\n'
           '  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 256, in __step\n'
           '    result = coro.send(None)\n'
           '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 112, in execute\n'
           '    raise TaskError("Failed to execute \'datahub ingest\'")\n'
           "acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"]}
Execution finished with errors.
g
Do you think it’d be helpful to hop on a call to debug this?
c
Copy code
Check role grants is not supported.
Looks like I need to remove this
g
Looks like the snowflake and snowflake-beta configs are slightly incompatible
yep
c
Removed check role grants, it is running now, I will update you the status
By the way, what is the difference between snowflake and snowflake-beta?
Looks like it has move forward, will update when it completes.
Meanwhile I have one more question. We have integrated Looker, airflow,snowflake. so when I look at the lineage for a table from snowflake, it should show me the entire lineage across the tools. EX: xyz table in snowflake is being utilized in airflow or looker. But for now, it shows me lineage among snowflake only. What needs to be done in order to enable lineage across different tools?
@gray-shoe-75895
Ingestion just completed successfully
g
Glad snowflake-beta is working. Overall snowflake-beta is a complete rewrite of the snowflake source, so it (1) is much faster and (2) automatically captures usage information, whereas the original snowflake required an accompanying snowflake-usage pipeline
In the next few days/weeks, we’re going to move snowflake-beta -> snowflake and the original snowflake -> snowflake-legacy, so it’s good to transition
For looker, you need to run both the looker and lookml sources. The lookml source should automatically generate lineage to your snowflake instance
For airflow, lineage is a bit more complex - unfortunately the best way to do it right now is by manually annotating your dag tasks with “inlets” and “outlets” (like this example)
c
@gray-shoe-75895
So can I run ingestion using snowflake-beta till it goes live?
For airflow, updating each dag is too much since our dags are very complex and adding inlets/outlets to each of them is too cumbersome, also we have many dags. Let me know if there is another solution instead of updating dags.
@gray-shoe-75895 Now I am getting following error message with snowflake-beta when trying ingestion of our database.
Copy code
"warnings": [], "failures": [{"error": "Unable to emit '
                      'metadata to DataHub GMS", "info": {"exceptionClass": "com.linkedin.restli.server.RestLiServiceException", "stackTrace": '
                      '"com.linkedin.restli.server.RestLiServiceException [HTTP Status:422]: Failed to validate record with class '
                      'com.linkedin.dataset.DatasetProfile: ERROR :: /sizeInBytes :: unrecognized field found but not allowed\\n\\n\\tat '
                      'com.linkedin.metadata.resources.entity.AspectResource.lambda$ingestProposal$3(AspectResource.java:142)", "message": "Failed '
                      'to validate record with class com.linkedin.dataset.DatasetProfile: ERROR :: /sizeInBytes :: unrecognized field found but not '
                      'allowed", "status": 422, "id": '
And I tried with source: snowflake It gives me below error message.
Copy code
, {"error": "Unable '
                      'to emit metadata to DataHub GMS", "info": {"exceptionClass": "com.linkedin.restli.server.RestLiServiceException", '
                      '"stackTrace": "com.linkedin.restli.server.RestLiServiceException [HTTP Status:500]: INTERNAL SERVER ERROR\\n\\tat '
                      'com.linkedin.restli.internal.server.RestLiMethodInvoker.doInvoke(RestLiMethodInvoker.java:210)\\n\\tat '
                      'com.linkedin.restli.internal.server.RestLiMethodInvoker.invoke(RestLiMethodInvoker.java:333)", "message": "INTERNAL SERVER '
                      'ERROR", "status": 500, "id": '
                      '"urn:li:dataset:(urn:li:dataPlatform:snowflake,max_dev.workspace.MAX_DEV\\",\\"WORKSPACE\\".temp,PROD)"}}], "start_time": '
                      '"2022-09-14 23:02:25.268523 (50 minutes and 45.29 seconds ago).", "current_time": "2022-09-14 23:53:10.555770 (now).", '
                      '"total_duration_in_seconds": "3045.29", "gms_version": "v0.8.42", "pending_requests": "0"}}}'}
Execution finished with errors.
h
Hey @cool-boots-36947 for snowflake-beta error, what version of GMS are you on ? This connector supports ingesting table storage size when profiling is enabled, however that needs GMS version 0.8.44 .
If you don't plan to upgrade GMS version , for now you can pin datahub cli version to 0.8.44 for snowflake managed ingestion - https://datahubproject.io/docs/ui-ingestion/#advanced-running-with-a-specific-cli-version
c
@hundreds-photographer-13496
We have below
DataHub CLI version: 0.8.41 Python version: 3.9.9 GMS : v0.8.42
I am using 0.8.44.2 in advance option from UI ingestion for snowflake-beta and snowflake source
h
Ok then you can use 0.8.44 instead of 0.8.44.2 from advanced option for snowflake-beta in order to make it work without error. - as you don't have latest GMS version.
c
One more question: How do we avoid temp tables to pop up? and How to we remove temap tables from datahub?
@gray-shoe-75895 How do we avoid temp tables to pop up? and How to we remove temp tables from datahub?
@gray-shoe-75895 I updated version to 0.8.44 in advanced option in UI and ran snowflake ingestion with snowflake-beta source, it again failed with below error.
Copy code
"profile_candidates": {}, "start_time": "2022-09-15 17:07:23.832015", "running_time_in_seconds": "2082", '
                      '"include_usage_stats": "True", "include_operational_stats": "True", "include_technical_schema": "True", '
                      '"table_lineage_query_secs": "321.60150102712214", "view_upstream_lineage_query_secs": "26.883752648020163", '
                      '"view_downstream_lineage_query_secs": "255.40169666311704", "external_lineage_queries_secs": "235.59301565494388"}}, "sink": '
                      '{"type": "datahub-rest", "report": {"total_records_written": "64948", "records_written_per_second": "31", "warnings": [], '
                      '"failures": [{"error": "Unable to emit metadata to DataHub GMS", "info": {"exceptionClass": '
                      '"com.linkedin.restli.server.RestLiServiceException", "stackTrace": "com.linkedin.restli.server.RestLiServiceException [HTTP '
                      'Status:500]: INTERNAL SERVER ERROR\\n\\tat '
                      'com.linkedin.restli.internal.server.RestLiMethodInvoker.doInvoke(RestLiMethodInvoker.java:210)\\n\\tat '
                      'com.linkedin.restli.internal.server.RestLiMethodInvoker.invoke(RestLiMethodInvoker.java:333)", "message": "INTERNAL SERVER '
                      'ERROR", "status": 500}}, {"error": "Unable to emit metadata to DataHub GMS", "info": {"exceptionClass": '
                      '"com.linkedin.restli.server.RestLiServiceException", "stackTrace": "com.linkedin.restli.server.RestLiServiceException [HTTP '
                      'Status:500]: INTERNAL SERVER ERROR\\n\\tat '
                      'com.linkedin.restli.internal.server.RestLiMethodInvoker.doInvoke(RestLiMethodInvoker.java:210)\\n\\tat '
                      'com.linkedin.restli.internal.server.RestLiMethodInvoker.invoke(RestLiMethodInvoker.java:333)", "message": "INTERNAL SERVER '
                      'ERROR", "status": 500}}, {"error": "Unable to emit metadata to DataHub GMS", "info": {"exceptionClass": '
                      '"com.linkedin.restli.server.RestLiServiceException", "stackTrace": "com.linkedin.restli.server.RestLiServiceException [HTTP '
                      'Status:500]: INTERNAL SERVER ERROR\\n\\tat '
                      'com.linkedin.restli.internal.server.RestLiMethodInvoker.doInvoke(RestLiMethodInvoker.java:210)\\n\\tat '
                      'com.linkedin.restli.internal.server.RestLiMethodInvoker.invoke(RestLiMethodInvoker.java:333)", "message": "INTERNAL SERVER '
                      'ERROR", "status": 500}}, {"error": "Unable to emit metadata to DataHub GMS", "info": {"exceptionClass": '
                      '"com.linkedin.restli.server.RestLiServiceException", "stackTrace": "com.linkedin.restli.server.RestLiServiceException [HTTP '
                      'Status:500]: INTERNAL SERVER ERROR\\n\\tat '
                      'com.linkedin.restli.internal.server.RestLiMethodInvoker.doInvoke(RestLiMethodInvoker.java:210)\\n\\tat '
                      'com.linkedin.restli.internal.server.RestLiMethodInvoker.invoke(RestLiMethodInvoker.java:333)", "message": "INTERNAL SERVER '
                      'ERROR", "status": 500}}, {"error": "Unable to emit metadata to DataHub GMS", "info": {"exceptionClass": '
                      '"com.linkedin.restli.server.RestLiServiceException", "stackTrace": "com.linkedin.restli.server.RestLiServiceException [HTTP '
                      'Status:500]: INTERNAL SERVER ERROR\\n\\tat '
                      'com.linkedin.restli.internal.server.RestLiMethodInvoker.doInvoke(RestLiMethodInvoker.java:210)\\n\\tat '
                      'com.linkedin.restli.internal.server.RestLiMethodInvoker.invoke(RestLiMethodInvoker.java:333)", "message": "INTERNAL SERVER '
                      'ERROR", "status": 500}}], "start_time": "2022-09-15 17:07:21.342508", "current_time": "2022-09-15 17:42:06.802729", '
                      '"total_duration_in_seconds": "2085.46", "gms_version": "v0.8.42", "pending_requests": "0"}}}'}
Execution finished with errors.
@gray-shoe-75895 Any update?
g
You can use the allow/deny patterns to filter out temp tables. Of our curiosity, what are the temp tables usually named like?
In the advanced section, you probably want 0.8.44.3, not 0.8.44
c
We have many tables starts with _xxxxxxx. What pattern should I enter?
g
You can simply put that into the table deny pattern field
c
EX:
table_pattern: deny: - "._*"
Is this right?
g
I think that should work, but it depends on the exact table naming convention
c
Naming convention is they start with "_", no other pattern
And another thing is, can we remove them from datahub which are already ingested?
One more question, do we have column level lineage in datahub?
g
the ones that were previously ingested will be removed automatically when ingestion is run again
our data model supports column-level lineage, but support for the UI + snowflake column lineage is coming soon