Hi all, I am facing problem in syncing the metada...
# troubleshoot
c
Hi all, I am facing problem in syncing the metadata from a hive metastore. I deployed spark thrift server pointing to a standalone hive metastore service. In the ingestion recipe I used the hive source type and host_port pointing to the spark thrift server. The ingestion is succeeding but it's creating only the database entity. Datasets/Tables are not being created. From the logs I observed something strange. It seems to be using database name as the table name. In the logs I could see the error "Table or view not found: test_db3.test_db3;" for the SQL call "DESCRIBE FORMATTED
test_db3
.`test_db3`". can anyone help ? am I missing anything in the setup ? is spark thrift server not supported ? what is the best way to sync the metadata from standalone hive metastore service ?
Copy code
source:
    type: hive
    config:
        host_port: 'spark-thrift-server-2.default.svc.cluster.local:10000'
        database: test_db3
        username: null
        password: null
        env: DEV
        include_tables: true
sink:
    type: datahub-rest
    config:
        server: '<http://datahub-datahub-gms.datahub2.svc.cluster.local:8080>'
g
Hey @curved-carpenter-44858, I am not sure if spark-thrift-server is supported to ingest hive metadata. What if you point to the hive metastore directly itself?
c
I tried that earlier. its failing in the first call itself. Below are the details on the error. When I ran it from the datahub frontend I got the below error. (pasted partial logs)
Copy code
......
    version, status, reason = self._read_status()\n'
           'File "/usr/local/lib/python3.9/http/client.py", line 289, in _read_status\n'
           '    raise RemoteDisconnected("Remote end closed connection without"\n'
           '\n'
           'RemoteDisconnected: Remote end closed connection without response\n',
           "2022-02-25 06:21:18.926125 [exec_id=a071f153-5777-419f-9511-37214e1429b6] INFO: Failed to execute 'datahub ingest'",
           '2022-02-25 06:21:18.926532 [exec_id=a071f153-5777-419f-9511-37214e1429b6] INFO: Caught exception EXECUTING '
           'task_id=a071f153-5777-419f-9511-37214e1429b6, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/default_executor.py", line 119, in execute_task\n'
           '    self.event_loop.run_until_complete(task_future)\n'
           '  File "/usr/local/lib/python3.9/site-packages/nest_asyncio.py", line 81, in run_until_complete\n'
           '    return f.result()\n'
           '  File "/usr/local/lib/python3.9/asyncio/futures.py", line 201, in result\n'
.......
In the metastore logs I found this. am I miss anything ? what could be the reason ?
Copy code
2022-02-25T06:19:46,599 ERROR [pool-6-thread-200] server.TThreadPoolServer: Thrift error occurred during processing of message.
org.apache.thrift.protocol.TProtocolException: Missing version in readMessageBegin, old client?
        at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:228) ~[libthrift-0.9.3.jar:0.9.3]
        at org.apache.hadoop.hive.metastore.TUGIBasedProcessor.process(TUGIBasedProcessor.java:76) ~[hive-standalone-metastore-3.1.2.jar:3.1.2]
        at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286) [libthrift-0.9.3.jar:0.9.3]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_322]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_322]
        at java.lang.Thread.run(Thread.java:750) [?:1.8.0_322]
b
i had the same issue, set in the recipe https://datahubproject.io/docs/generated/ingestion/sources/hive#quickstart-recipe the option
Copy code
#scheme: 'sparksql' # set this for Spark Thrift Server