https://datahubproject.io logo
#ingestion
Title
# ingestion
s

sparse-monitor-9160

06/14/2022, 12:32 PM
Hello everyone. I set up datahub locally and try to ingest data source from Snowflake through CLI. Got the error:
Copy code
[2022-06-14 08:27:04,506] INFO     {datahub.cli.ingest_cli:99} - DataHub CLI version: 0.8.38
[2022-06-14 08:27:10,903] ERROR    {datahub.entrypoints:167} - Stackprinter failed while formatting <FrameInfo /usr/local/lib/python3.9/site-packages/datahub/ingestion/source/sql/sql_common.py, line 270, scope SQLAlchemyConfig>:
  File "/usr/local/lib/python3.9/site-packages/stackprinter/frame_formatting.py", line 225, in select_scope
    raise Exception("Picked an invalid source context: %s" % info)
Exception: Picked an invalid source context: [270], [219], dict_keys([219, 220])

So here is your original traceback at least:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/datahub/cli/ingest_cli.py", line 106, in run
    pipeline = Pipeline.create(pipeline_config, dry_run, preview, preview_workunits)
  File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/run/pipeline.py", line 202, in create
    return cls(
  File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/run/pipeline.py", line 149, in __init__
    source_class = source_registry.get(source_type)
  File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/api/registry.py", line 126, in get
    tp = self._ensure_not_lazy(key)
  File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/api/registry.py", line 84, in _ensure_not_lazy
    plugin_class = import_path(path)
  File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/api/registry.py", line 32, in import_path
    item = importlib.import_module(module_name)
  File "/usr/local/Cellar/python@3.9/3.9.7/Frameworks/Python.framework/Versions/3.9/lib/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 850, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/source/sql/snowflake.py", line 29, in <module>
    from datahub.ingestion.source.sql.sql_common import (
  File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/source/sql/sql_common.py", line 236, in <module>
    class SQLAlchemyConfig(StatefulIngestionConfigBase):
  File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/source/sql/sql_common.py", line 270, in SQLAlchemyConfig
    from datahub.ingestion.source.ge_data_profiler import GEProfilingConfig
  File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/source/ge_data_profiler.py", line 12, in <module>
    from great_expectations import __version__ as ge_version
  File "/usr/local/lib/python3.9/site-packages/great_expectations/__init__.py", line 7, in <module>
    from great_expectations.data_context import DataContext
  File "/usr/local/lib/python3.9/site-packages/great_expectations/data_context/__init__.py", line 1, in <module>
    from great_expectations.data_context.data_context import (
  File "/usr/local/lib/python3.9/site-packages/great_expectations/data_context/data_context/__init__.py", line 1, in <module>
    from great_expectations.data_context.data_context.base_data_context import (
  File "/usr/local/lib/python3.9/site-packages/great_expectations/data_context/data_context/base_data_context.py", line 20, in <module>
    from great_expectations.core.config_peer import ConfigPeer
  File "/usr/local/lib/python3.9/site-packages/great_expectations/core/__init__.py", line 3, in <module>
    from .expectation_suite import (
  File "/usr/local/lib/python3.9/site-packages/great_expectations/core/expectation_suite.py", line 10, in <module>
    from great_expectations.core.evaluation_parameters import (
  File "/usr/local/lib/python3.9/site-packages/great_expectations/core/evaluation_parameters.py", line 27, in <module>
    from great_expectations.core.util import convert_to_json_serializable
  File "/usr/local/lib/python3.9/site-packages/great_expectations/core/util.py", line 22, in <module>
    from great_expectations.types import SerializableDictDot
  File "/usr/local/lib/python3.9/site-packages/great_expectations/types/__init__.py", line 15, in <module>
    import pyspark
  File "/usr/local/lib/python3.9/site-packages/pyspark/__init__.py", line 51, in <module>
    from pyspark.context import SparkContext
  File "/usr/local/lib/python3.9/site-packages/pyspark/context.py", line 31, in <module>
    from pyspark import accumulators
  File "/usr/local/lib/python3.9/site-packages/pyspark/accumulators.py", line 97, in <module>
    from pyspark.serializers import read_int, PickleSerializer
  File "/usr/local/lib/python3.9/site-packages/pyspark/serializers.py", line 72, in <module>
    from pyspark import cloudpickle
  File "/usr/local/lib/python3.9/site-packages/pyspark/cloudpickle.py", line 145, in <module>
    _cell_set_template_code = _make_cell_set_template_code()
  File "/usr/local/lib/python3.9/site-packages/pyspark/cloudpickle.py", line 126, in _make_cell_set_template_code
    return types.CodeType(
TypeError: an integer is required (got type bytes)

[2022-06-14 08:27:10,904] INFO     {datahub.entrypoints:176} - DataHub CLI version: 0.8.38 at /usr/local/lib/python3.9/site-packages/datahub/__init__.py
[2022-06-14 08:27:10,904] INFO     {datahub.entrypoints:179} - Python version: 3.9.7 (default, Sep  3 2021, 12:36:14) 
[Clang 11.0.0 (clang-1100.0.33.17)] at /usr/local/opt/python@3.9/bin/python3.9 on macOS-10.14.6-x86_64-i386-64bit
[2022-06-14 08:27:10,904] INFO     {datahub.entrypoints:182} - GMS config {'models': {}, 'versions': {'linkedin/datahub': {'version': 'v0.8.38', 'commit': '38718b59b358fc6c564ee982752bf2023533b224'}}, 'managedIngestion': {'defaultCliVersion': '0.8.38', 'enabled': True}, 'statefulIngestionCapable': True, 'supportsImpactAnalysis': True, 'telemetry': {'enabledCli': True, 'enabledIngestion': False}, 'datasetUrnNameCasing': False, 'retention': 'true', 'datahub': {'serverType': 'quickstart'}, 'noCode': 'true'}
Here is my YML (with sensitive data replaced by
my_
):
Copy code
source:
  type: "snowflake"
  config:
    account_id: "my_account.us-east-1"
    warehouse: "sor_wh"
    username: "my_username"
    password: "my_password"
    role: "my_role"
    include_views: false
    include_table_lineage: false
    table_pattern:
      allow:
        - "temp_1"

sink:
  type: "datahub-rest"
  config:
    server: '<http://localhost:8080>'
Here is my environment:
Copy code
$ python -c "import platform; print(platform.platform())"
Darwin-18.7.0-x86_64-i386-64bit

$ python -c "import sys; print(sys.version); print(sys.executable); import datahub; print(datahub.__file__); print(datahub.__version__);"
2.7.16 (default, Jan 27 2020, 04:46:15)                                                                                                                                                                                         
[GCC 4.2.1 Compatible Apple LLVM 10.0.1 (clang-1001.0.37.14)]
/usr/bin/python
Traceback (most recent call last):
File "<string>", line 1, in <module>
ImportError: No module named datahub

$ python3 -c "import sys; print(sys.version); print(sys.executable); import datahub; print(datahub.__file__); print(datahub.__version__);"
3.9.7 (default, Sep  3 2021, 12:36:14)
[Clang 11.0.0 (clang-1100.0.33.17)]
/usr/local/opt/python@3.9/bin/python3.9
/usr/local/lib/python3.9/site-packages/datahub/__init__.py
0.8.38
b

bulky-soccer-26729

06/14/2022, 3:26 PM
hey Helen! I noticed you found this thread with your same error: https://datahubspace.slack.com/archives/CUMUWQU66/p1644474964313429
would you mind updating your pyspark version to see if that fixes the issue?
s

sparse-monitor-9160

06/14/2022, 5:34 PM
Thank Chris. Sorry for delay — I have two pyspark installed locally and need to upgrade the correct one. It works after upgrade to pyspark 3.2.1 (was 2.4.8).
b

bulky-soccer-26729

06/14/2022, 5:35 PM
gotcha gotcha - glad to hear you've got things working!
8 Views