clean-tomato-22549
09/26/2022, 9:46 AMIf you plan to enable extraction of table lineage, via the include_table_lineage config flag or extraction of usage statistics, via the include_usage_stats config, you'll also need to grant access to the Account Usage system tables, using which the DataHub source extracts information. This can be done by granting access to the snowflake database.
fancy-alligator-33404
09/26/2022, 10:54 AMfew-carpenter-93837
09/26/2022, 7:02 AMcareful-action-61962
09/26/2022, 11:14 AMfierce-baker-1392
09/26/2022, 12:21 PMlemon-engine-23512
09/26/2022, 1:11 PMproud-table-38689
09/26/2022, 3:58 PMssl_args = {'ssl_ca': ca_path}
engine = create_engine("mysql+mysqlconnector://<user>:<pass>@<addr>/<schema>",
connect_args=ssl_args)
proud-table-38689
09/26/2022, 3:58 PMconnect_args=ssl_args
onecreamy-tent-10151
09/26/2022, 5:18 PMbland-balloon-48379
09/26/2022, 7:14 PMclean-tomato-22549
09/27/2022, 3:06 AMdatahub.ingestion.run.pipeline.PipelineInitError: Failed to configure source (presto-on-hive)
[2022-09-26 05:58:34,312] ERROR {datahub.entrypoints:195} - Command failed:
Failed to configure source (presto-on-hive) due to
'TSocket read 0 bytes'.
Run with --debug to get full stacktrace.
melodic-beach-18239
09/27/2022, 8:00 AMmicroscopic-mechanic-13766
09/27/2022, 10:29 AM/datahub/metadata-ingestion/src/datahub/ingestion/source
inside the project?bumpy-journalist-41369
09/27/2022, 11:15 AM" 'warnings': {'<s3://aws-glue-scripts-063693278873-us-east-1/NilayDev/CompressionS3.py>': ['Error parsing DAG for Glue job. The script '\n"
' '
"'<s3://aws-glue-scripts-063693278873-us-east-1/NilayDev/CompressionS3.py> '\n"
" 'cannot be processed by Glue (this usually "
"occurs when it '\n"
" 'has been user-modified): An error occurred '\n"
" '(InvalidInputException) when calling the "
"GetDataflowGraph '\n"
" 'operation: line 19:4 no viable alternative at "
"input '\n"
' "\'e3g:))\'e:)o.)) #\'"],\n'
" '<s3://cdc-analytics-dev-us-east-1-alert-classification-glue/ttp_window_features.py>': ['Error parsing DAG for Glue job. The "
"script '\n"
' '
"'<s3://cdc-analytics-dev-us-east-1-alert-classification-glue/ttp_window_features.py> '\n"
" 'cannot be processed by Glue (this "
"usually '\n"
" 'occurs when it has been "
"user-modified): An '\n"
" 'error occurred "
"(InvalidInputException) when '\n"
" 'calling the GetDataflowGraph "
"operation: line '\n"
" '337:12 no viable alternative at "
"input '\n"
'
as well as this :
exception=NoSuchKey('An error occurred (NoSuchKey) wh\n"
" en calling the GetObject operation: The specified key does not exist.')>\n"
the recipe I am using is :
sink:
type: datahub-rest
config:
server: ‘http://datahub-datahub-gms:8080’
source:
type: glue
config:
aws_region: us-east-1
env: DEV
database_pattern:
allow:
- cdca
Does anyone know how to fix the problem?future-hair-23690
09/27/2022, 2:40 PMkind-dawn-17532
09/27/2022, 11:25 AMfast-potato-13714
09/27/2022, 3:20 PMgreen-lion-58215
09/27/2022, 3:43 PMwonderful-notebook-20086
09/27/2022, 7:18 PMgetting-started
docker container images based on the Quickstart guide
I tried setting up a connection to our RS cluster and ran into this error:
'2022-09-27 18:30:45.362431 [exec_id=fec3ab48-c33b-4403-abfc-f61720c609ae] INFO: Starting execution for task with name=RUN_INGEST',
'2022-09-27 18:47:03.670827 [exec_id=fec3ab48-c33b-4403-abfc-f61720c609ae] INFO: Caught exception EXECUTING '
'task_id=fec3ab48-c33b-4403-abfc-f61720c609ae, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
' File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 182, in execute\n'
' await tasks.gather(_read_output_lines(), _report_progress(), _process_waiter())\n'
' File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 126, in '
'_read_output_lines\n'
' full_log_file.write(line)\n'
'OSError: [Errno 28] No space left on device\n'
'\n'
'During handling of the above exception, another exception occurred:\n'
'\n'
'OSError: [Errno 28] No space left on device\n'
'\n'
'During handling of the above exception, another exception occurred:\n'
'\n'
'Traceback (most recent call last):\n'
' File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/default_executor.py", line 123, in execute_task\n'
' task_event_loop.run_until_complete(task_future)\n'
' File "/usr/local/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete\n'
' return future.result()\n'
' File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 188, in execute\n'
' full_log_file.close()\n'
'OSError: [Errno 28] No space left on device\n'
Ingestion recipe yml looks something like this:
source:
type: redshift
config:
start_time: '2022-09-26 00:00:00Z'
end_time: '2022-09-26 12:00:00Z'
table_lineage_mode: mixed
include_table_lineage: true
database: insightsetl
password: '${etl2_test_datahub_creds}'
profiling:
enabled: true
host_port: '<http://pi-redshift-etl-2-test.ccvpgkqogsrc.us-east-1.redshift.amazonaws.com:8192|pi-redshift-etl-2-test.ccvpgkqogsrc.us-east-1.redshift.amazonaws.com:8192>'
stateful_ingestion:
enabled: true
username: datahub_ingestion
pipeline_name: 'urn:li:dataHubIngestionSource:93b5640d-8ed3-456e-89f9-0ec3def38733'
I'm not sure if it's a container issue or config or something else.aloof-leather-92383
09/27/2022, 9:14 PMproud-table-38689
09/28/2022, 1:45 AMteradatasqlalchemy
. I added it to the docker image but I don’t see it used in the venv
that’s used per ingestionfew-sugar-84064
09/28/2022, 2:36 AMsource:
type: glue
config:
aws_region: "ap-northeast-2"
extract_transforms: True
catalog_id: "catalog_id"
sink:
type: "datahub-rest"
config:
server: "gms sever address"
chilly-ability-77706
09/28/2022, 5:33 AMchilly-ability-77706
09/28/2022, 5:33 AMsource:
type: hive
config:
database: <>
password: <>
host_port: <>
stateful_ingestion:
enabled: true
username: <>
options:
connect_args:
auth: NOSASL
http_path: "/gateway/default/hive"
ssl_cert: "required"
scheme: hive+https
gifted-diamond-19544
09/28/2022, 8:40 AMthankful-ghost-61888
09/28/2022, 9:02 AM${EXTENDED}
keyword (docs here). From the ingestion logs:
b"2022-09-25 10:05:57,890 ERROR SQL lineage analyzer error 'An Identifier is expected, got Token[value: EXTENDED] instead.' for query: 'SELECT\n"
[2022-09-25 10:05:57,891] {{pod_launcher.py:156}} INFO - b" date_trunc('week',purchase_date) as purchase_date,\n"
[2022-09-25 10:05:57,891] {{pod_launcher.py:156}} INFO - b' user_id,\n'
[2022-09-25 10:05:57,891] {{pod_launcher.py:156}} INFO - b' buyer_country,\n'
[2022-09-25 10:05:57,892] {{pod_launcher.py:156}} INFO - b' count(id) as items_bought_week,\n'
[2022-09-25 10:05:57,892] {{pod_launcher.py:156}} INFO - b' sum(GMV) as gmv_bought_week,\n'
[2022-09-25 10:05:57,892] {{pod_launcher.py:156}} INFO - b' min(user_order_sequence_number) as user_order_sequence_number_minweek,\n'
[2022-09-25 10:05:57,892] {{pod_launcher.py:156}} INFO - b' max(user_order_sequence_number) as user_order_sequence_number_maxweek,\n'
[2022-09-25 10:05:57,892] {{pod_launcher.py:156}} INFO - b" percent_rank() over (partition by date_trunc('week',purchase_date) order by items_bought_week, sum(GMV)) as rank_items_bought,\n"
[2022-09-25 10:05:57,892] {{pod_launcher.py:156}} INFO - b" percent_rank() over (partition by date_trunc('week',purchase_date),buyer_country order by items_bought_week, sum(GMV)) as rank_items_bought_country\n"
[2022-09-25 10:05:57,892] {{pod_launcher.py:156}} INFO - b' FROM (EXTENDED)\n'
[2022-09-25 10:05:57,892] {{pod_launcher.py:156}} INFO - b' GROUP BY 1,2,3\n'
[2022-09-25 10:05:57,892] {{pod_launcher.py:156}} INFO - b'2022-09-25 10:05:57,890 ERROR sql holder not present so cannot get tables\n'
[2022-09-25 10:05:57,894] {{pod_launcher.py:156}} INFO - b'2022-09-25 10:05:57,890 ERROR sql holder not present so cannot get columns\n'
The original lookML sql looks like this:
include: "dt_engine_purchasetransaction.view"
view: dt_buyer_transactions_weekly {
extends: [dt_engine_purchasetransaction]
derived_table: {
sql:
SELECT
date_trunc('week',purchase_date) as purchase_date,
user_id,
buyer_country,
count(id) as items_bought_week,
sum(GMV) as gmv_bought_week,
min(user_order_sequence_number) as user_order_sequence_number_minweek,
max(user_order_sequence_number) as user_order_sequence_number_maxweek,
percent_rank() over (partition by date_trunc('week',purchase_date) order by items_bought_week, sum(GMV)) as rank_items_bought,
percent_rank() over (partition by date_trunc('week',purchase_date),buyer_country order by items_bought_week, sum(GMV)) as rank_items_bought_country
FROM (${EXTENDED})
GROUP BY 1,2,3
;;
}
We’re also seeing an issue with parsing a new line after the FROM
statement:
b"2022-09-25 10:06:10,069 ERROR SQL lineage analyzer error 'An Identifier is expected, got Token[value: \n"
[2022-09-25 10:06:10,070] {{pod_launcher.py:156}} INFO - b"] instead.' for query: 'SELECT\n"
[2022-09-25 10:06:10,070] {{pod_launcher.py:156}} INFO - b' __d_a_t_e\n'
[2022-09-25 10:06:10,070] {{pod_launcher.py:156}} INFO - b' , userid\n'
[2022-09-25 10:06:10,070] {{pod_launcher.py:156}} INFO - b' , query\n'
[2022-09-25 10:06:10,070] {{pod_launcher.py:156}} INFO - b' , sum(total_searches) AS total_searches\n'
[2022-09-25 10:06:10,070] {{pod_launcher.py:156}} INFO - b' FROM\n'
[2022-09-25 10:06:10,070] {{pod_launcher.py:156}} INFO - b' (\n'
[2022-09-25 10:06:10,070] {{pod_launcher.py:156}} INFO - b' (SELECT\n'...
The original syntax looks like this:
view: product_interaction_searches {
derived_table: {
sql_trigger_value: SELECT max(event_date::date) FROM datalake_processed.etl_tracking_searches;;
distribution: "date"
sortkeys: ["date"]
sql:
SELECT
date
, userid
, query
, sum(total_searches) AS total_searches
FROM
(
(SELECT
cast(date as date) AS date
, userid AS userid
, query
, count(*) AS total_searches
FROM datalake_compacted.mixpanel_tracking_search_results_query_action...
Any thoughts about the above?careful-action-61962
09/28/2022, 10:12 AMflaky-soccer-57765
09/28/2022, 11:50 AMearly-airplane-84388
09/28/2022, 12:19 PM~~~~ Execution Summary ~~~~
RUN_INGEST - {'errors': [],
'exec_id': 'efbfe9b7-fd83-4a37-bff7-d6fd3e2186dc',
'infos': ['2022-09-28 11:44:07.154056 [exec_id=efbfe9b7-fd83-4a37-bff7-d6fd3e2186dc] INFO: Starting execution for task with name=RUN_INGEST',
'2022-09-28 11:44:07.182384 [exec_id=efbfe9b7-fd83-4a37-bff7-d6fd3e2186dc] INFO: Caught exception EXECUTING '
'task_id=efbfe9b7-fd83-4a37-bff7-d6fd3e2186dc, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
' File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/default_executor.py", line 122, in execute_task\n'
' self.event_loop.run_until_complete(task_future)\n'
' File "/usr/local/lib/python3.9/site-packages/nest_asyncio.py", line 89, in run_until_complete\n'
' return f.result()\n'
' File "/usr/local/lib/python3.9/asyncio/futures.py", line 201, in result\n'
' raise self._exception\n'
' File "/usr/local/lib/python3.9/asyncio/tasks.py", line 256, in __step\n'
' result = coro.send(None)\n'
' File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 67, in execute\n'
' recipe: dict = SubProcessTaskUtil._resolve_recipe(validated_args.recipe, ctx, self.ctx)\n'
' File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_task_common.py", line 84, in _resolve_recipe\n'
' raise TaskError(f"Failed to resolve secret with name {match}. Aborting recipe execution.")\n'
'acryl.executor.execution.task.TaskError: Failed to resolve secret with name Dev_DataHub_PRIVATE_KEY_ID. Aborting recipe execution.\n']}
Execution finished with errors.
ancient-policeman-73437
09/28/2022, 7:09 PM'/usr/local/bin/run_ingest.sh: line 26: 4036 Killed ( python3 -m datahub ingest -c "$4/$1.yml" )\n',
"2022-09-28 14:04:23.217735 [exec_id=900d34f5-3632-4ada-a541-aa104a65e6ca] INFO: Failed to execute 'datahub ingest'",
'2022-09-28 14:04:23.218122 [exec_id=900d34f5-3632-4ada-a541-aa104a65e6ca] INFO: Caught exception EXECUTING '
what could be a reason ? Thank you in advance!