Good afternoon everyone, quick question: Does anyo...
# troubleshoot
m
Good afternoon everyone, quick question: Does anyone know if the connection to a Hive with hive.server2.transport.mode set to HTTP is supported?? The default value of that property is binary, but due to external causes had to change it to HTTP and now the Hive-Datahub connection is giving problems
h
You can also set hive.server2.transport.mode to all . does that work for you ? https://datahubspace.slack.com/archives/CUMUWQU66/p1657887439160919?thread_ts=1657596305.743549&cid=CUMUWQU66
m
I think it might work for Hive 4 version, but as I have version 3, putting something but
http
or
binary
makes it take the default value, which is binary.
h
oh. are you using LDAP ?
m
I am using Kerberos as authentication method
h
In that case, can you try setting this in source config and check if this works ?
scheme: 'hive+http'
m
I obtain the following error if that is set
Copy code
'/tmp/datahub/ingest/venv-hive-0.9.\n'
           '                       0.4/lib/python3.10/site-packages/datahub/cli/ingest_cli.py:155> exception=ModuleNotFoundError("No module named '
           "'kerberos\n"
           '                       \'")>\n'
           "     run_pipeline_async = <function 'run.<locals>.run_pipeline_async' ingest_cli.py:155>\n"
h
hmm, was kerberos auth working fine before ? Are you using UI managed ingestion ?
m
Yeah, kerberos works fine with hive with binary.
And yes I am using UI ingestion, although I have also tested it with CLI ingestion and still get the same error
h
can you try installing kerberos manually on CLI using "pip install kerberos" and retry ?
Also, if the above error contains its stack trace, can you please share it ?
m
Although I install it, it prints the same error (I think due to the property
scheme:hive+http
) If that property is not set, now the realm to which it is trying to obtain the ticket from is wrong
h
Thanks for sharing the stack trace. The error seems to be coming from pyhive. From what you mentioned, when using CLI ingestion with the scheme hive+http , you continue to get module not found error. Can you run
pip install 'acryl-pyhive[kerberos]'
on CLI and confirm it successfully installs. (we don't need
pip install kerberos
that I mentioned earlier, so that can be uninstalled)
m
The kerberos module of acryl-pyhive is successfully installed, but the error message keeps popping up
Copy code
datahub@datahub-actions:/$ pip install 'acryl-pyhive[kerberos]'
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: acryl-pyhive[kerberos] in /usr/local/lib/python3.10/site-packages (0.6.13)
Requirement already satisfied: future in /usr/local/lib/python3.10/site-packages (from acryl-pyhive[kerberos]) (0.18.2)
Requirement already satisfied: python-dateutil in /usr/local/lib/python3.10/site-packages (from acryl-pyhive[kerberos]) (2.8.2)
Collecting requests-kerberos>=0.12.0
  Downloading requests_kerberos-0.14.0-py2.py3-none-any.whl (11 kB)
Collecting pyspnego[kerberos]
  Downloading pyspnego-0.6.3-py3-none-any.whl (124 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 124.9/124.9 kB 3.8 MB/s eta 0:00:00
Requirement already satisfied: requests>=1.1.0 in /usr/local/lib/python3.10/site-packages (from requests-kerberos>=0.12.0->acryl-pyhive[kerberos]) (2.28.0)
Requirement already satisfied: cryptography>=1.3 in /usr/local/lib/python3.10/site-packages (from requests-kerberos>=0.12.0->acryl-pyhive[kerberos]) (36.0.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/site-packages (from python-dateutil->acryl-pyhive[kerberos]) (1.16.0)
Requirement already satisfied: cffi>=1.12 in /usr/local/lib/python3.10/site-packages (from cryptography>=1.3->requests-kerberos>=0.12.0->acryl-pyhive[kerberos]) (1.15.0)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.10/site-packages (from requests>=1.1.0->requests-kerberos>=0.12.0->acryl-pyhive[kerberos]) (1.26.9)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/site-packages (from requests>=1.1.0->requests-kerberos>=0.12.0->acryl-pyhive[kerberos]) (2022.6.15)
Requirement already satisfied: charset-normalizer~=2.0.0 in /usr/local/lib/python3.10/site-packages (from requests>=1.1.0->requests-kerberos>=0.12.0->acryl-pyhive[kerberos]) (2.0.12)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/site-packages (from requests>=1.1.0->requests-kerberos>=0.12.0->acryl-pyhive[kerberos]) (3.3)
Collecting krb5>=0.3.0
  Downloading krb5-0.4.1.tar.gz (218 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 218.7/218.7 kB 13.7 MB/s eta 0:00:00
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Installing backend dependencies ... done
  Preparing metadata (pyproject.toml) ... done
Collecting gssapi>=1.6.0
  Downloading gssapi-1.8.2.tar.gz (94 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 94.3/94.3 kB 4.8 MB/s eta 0:00:00
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Installing backend dependencies ... done
  Preparing metadata (pyproject.toml) ... done
Requirement already satisfied: pycparser in /usr/local/lib/python3.10/site-packages (from cffi>=1.12->cryptography>=1.3->requests-kerberos>=0.12.0->acryl-pyhive[kerberos]) (2.21)
Requirement already satisfied: decorator in /usr/local/lib/python3.10/site-packages (from gssapi>=1.6.0->pyspnego[kerberos]->requests-kerberos>=0.12.0->acryl-pyhive[kerberos]) (5.1.1)
Building wheels for collected packages: gssapi, krb5
  Building wheel for gssapi (pyproject.toml) ... done
  Created wheel for gssapi: filename=gssapi-1.8.2-cp310-cp310-linux_x86_64.whl size=3335479 sha256=ec5cabb4d5f868a811524fa95e7a0ea238382601ad45c3bc223ebbc12adf16ee
  Stored in directory: /home/datahub/.cache/pip/wheels/59/a8/83/5017e55a50e766ad6874c236b60fdace4f8552a00a1ebc9474
  Building wheel for krb5 (pyproject.toml) ... done
  Created wheel for krb5: filename=krb5-0.4.1-cp310-cp310-linux_x86_64.whl size=4405756 sha256=a3e3ac43c21d4cb7a4f27b896952b60df8363ace82db2e48aac8622f9c93a560
  Stored in directory: /home/datahub/.cache/pip/wheels/04/07/80/b1e1c44fecd717bd7ef457b78dc92bf15eedc095ca0236e917
Successfully built gssapi krb5
Installing collected packages: krb5, gssapi, pyspnego, requests-kerberos
  WARNING: The script pyspnego-parse is installed in '/home/datahub/.local/bin' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
Successfully installed gssapi-1.8.2 krb5-0.4.1 pyspnego-0.6.3 requests-kerberos-0.14.0
h
Hey @microscopic-mechanic-13766 just confirming - are you installing in datahub actions and then running using UI ingestion(this won't work) OR CLI ingestion ? The earlier logs seemed to be from UI ingestion. It would be great if you can share error logs from CLI ingestion.
m
I am doing everything in the CLI. Just managed to stop the error from popping: I had to install both
acryl-pyhive[kerberos]
and
kerberos
. After that I was getting another error:
Copy code
---- (full traceback above) ----
File "/usr/local/lib/python3.10/site-packages/datahub/entrypoints.py", line 149, in main
    sys.exit(datahub(standalone_mode=False, **kwargs))
File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/datahub/telemetry/telemetry.py", line 343, in wrapper
    raise e
File "/usr/local/lib/python3.10/site-packages/datahub/telemetry/telemetry.py", line 295, in wrapper
    res = func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/datahub/utilities/memory_leak_detector.py", line 102, in wrapper
    return func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 205, in run
    loop.run_until_complete(run_func_check_upgrade(pipeline))
File "/usr/local/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete
    return future.result()
File "/usr/local/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 161, in run_func_check_upgrade
    ret = await the_one_future
File "/usr/local/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 152, in run_pipeline_async
    return await loop.run_in_executor(
File "/usr/local/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
File "/usr/local/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 143, in run_pipeline_to_completion
    raise e
File "/usr/local/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 129, in run_pipeline_to_completion
    pipeline.run()
File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 334, in run
    for wu in itertools.islice(
File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/source/sql/sql_common.py", line 728, in get_workunits
    for inspector in self.get_inspectors():
File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/source/sql/sql_common.py", line 532, in get_inspectors
    with engine.connect() as conn:
File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 2263, in connect
    return self._connection_cls(self, **kwargs)
File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 104, in __init__
    else engine.raw_connection()
File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 2369, in raw_connection
    return self._wrap_pool_connect(
File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 2336, in _wrap_pool_connect
    return fn()
File "/usr/local/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 304, in unique_connection
    return _ConnectionFairy._checkout(self)
File "/usr/local/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 778, in _checkout
    fairy = _ConnectionRecord.checkout(pool)
File "/usr/local/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 495, in checkout
    rec = pool._do_get()
File "/usr/local/lib/python3.10/site-packages/sqlalchemy/pool/impl.py", line 139, in _do_get
    with util.safe_reraise():
File "/usr/local/lib/python3.10/site-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__
    compat.raise_(
File "/usr/local/lib/python3.10/site-packages/sqlalchemy/util/compat.py", line 182, in raise_
    raise exception
File "/usr/local/lib/python3.10/site-packages/sqlalchemy/pool/impl.py", line 137, in _do_get
    return self._create_connection()
File "/usr/local/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 309, in _create_connection
    return _ConnectionRecord(self)
File "/usr/local/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 440, in __init__
    self.__connect(first_connect_check=True)
File "/usr/local/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 660, in __connect
    with util.safe_reraise():
File "/usr/local/lib/python3.10/site-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__
    compat.raise_(
File "/usr/local/lib/python3.10/site-packages/sqlalchemy/util/compat.py", line 182, in raise_
    raise exception
File "/usr/local/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 656, in __connect
    connection = pool._invoke_creator(self)
File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/strategies.py", line 114, in connect
    return dialect.connect(*cargs, **cparams)
File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 508, in connect
    return self.dbapi.connect(*cargs, **cparams)
File "/usr/local/lib/python3.10/site-packages/pyhive/hive.py", line 126, in connect
    return Connection(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/pyhive/hive.py", line 273, in __init__
    response = self._client.OpenSession(open_session_req)
File "/usr/local/lib/python3.10/site-packages/TCLIService/TCLIService.py", line 186, in OpenSession
    self.send_OpenSession(req)
File "/usr/local/lib/python3.10/site-packages/TCLIService/TCLIService.py", line 195, in send_OpenSession
    self._oprot.trans.flush()
File "/usr/local/lib/python3.10/site-packages/pyhive/hive.py", line 81, in flush
    super(TCookieHttpClient, self).flush()
File "/usr/local/lib/python3.10/site-packages/thrift/transport/THttpClient.py", line 191, in flush
    self.__http.putheader('Cookie', self.headers['Set-Cookie'])
File "/usr/local/lib/python3.10/http/client.py", line 1244, in putheader
    raise CannotSendHeader()
This thing was solved by downgrading the thrift version to 0.13.0, as the 0.16.0 version has a bug on some of the methods in the stack trace. By doing these two things, I was able to successfully ingest from Hive with HTTP!
The thing is that I try to ingest from the UI, the kerberos module error pops up again. Could it be related to the fact that the UI ingestion install the libraries needed and overwrites the existant libraries??
h
Yes. UI ingestion creates a separate virtual environment and installs libraries needed in that. So what are the exact steps/pip installs that UI ingestion needs on top?
i