Hello, I am using Datahub to ingest the data table...
# ingestion
a
Hello, I am using Datahub to ingest the data table Metadata of Databricks, source selected Hive, but there is currently a 200 million data table that has been stuck in the analysis step during the ingestion process, and the log content prompts that there are no newly generated logs for many seconds (WARNING: These logs appear to be stale. No new logs have been received since 2023-05-26 102225.389811 (53443 seconds ago). However, the ingestion process still appears to be running and may complete normally.), I guess it may be caused by the execution time of the analysis SQL exceeding the time to connect to the databricks. The execution time of the analysis SQL takes about two minutes, so I want to know How to configure the timeout for the databricks connection. Can you help me?
1
g
Hi @dazzling-judge-80093 Do we need to consider this as enhancement in hive source ? I haven't find any configuration option in recipe
d
Which phase is it stuck in? Can you connect to the Hive Metastore db? If yes then I think Persto On Hive is a much better option at this scale.
a
Stuck in the Profiling stage, can connect to Hive Metastore, technical Metadata has been successfully ingested.
a
Profiling can be expensive as it runs SQL queries on each table. What you can do is to set profiling filter to only profile key datasets.
a
I only configured one table in the ingestion configuration file
I set the query_combiner_enabled to false and the issue was resolved
a
ahh, cool; thanks for the update
cc @gray-shoe-75895