Good day everyone, I’m trying to ingest data from ...
# ingestion
g
Good day everyone, I’m trying to ingest data from snowflake, however, I got an error like this(in the thread), can someone help me with it? thanks very much.
Copy code
dna-datahub-spike ➤ datahub ingest -c ./snowflake-ingestion.yml                                                                                                                                         git:master*
[2022-02-10 17:33:13,614] ERROR    {datahub.entrypoints:119} - Stackprinter failed while formatting <FrameInfo /opt/anaconda3/lib/python3.8/site-packages/datahub/ingestion/source/sql/sql_common.py, line 221, scope SQLAlchemyConfig>:
  File "/opt/anaconda3/lib/python3.8/site-packages/stackprinter/frame_formatting.py", line 224, in select_scope
    raise Exception("Picked an invalid source context: %s" % info)
Exception: Picked an invalid source context: [221], [192], dict_keys([192, 193])

So here is your original traceback at least:

Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.8/site-packages/datahub/cli/ingest_cli.py", line 77, in run
    pipeline = Pipeline.create(pipeline_config, dry_run, preview)
  File "/opt/anaconda3/lib/python3.8/site-packages/datahub/ingestion/run/pipeline.py", line 175, in create
    return cls(config, dry_run=dry_run, preview_mode=preview_mode)
  File "/opt/anaconda3/lib/python3.8/site-packages/datahub/ingestion/run/pipeline.py", line 120, in __init__
    source_class = source_registry.get(source_type)
  File "/opt/anaconda3/lib/python3.8/site-packages/datahub/ingestion/api/registry.py", line 126, in get
    tp = self._ensure_not_lazy(key)
  File "/opt/anaconda3/lib/python3.8/site-packages/datahub/ingestion/api/registry.py", line 84, in _ensure_not_lazy
    plugin_class = import_path(path)
  File "/opt/anaconda3/lib/python3.8/site-packages/datahub/ingestion/api/registry.py", line 32, in import_path
    item = importlib.import_module(module_name)
  File "/opt/anaconda3/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 783, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/opt/anaconda3/lib/python3.8/site-packages/datahub/ingestion/source/sql/snowflake.py", line 28, in <module>
    from datahub.ingestion.source.sql.sql_common import (
  File "/opt/anaconda3/lib/python3.8/site-packages/datahub/ingestion/source/sql/sql_common.py", line 206, in <module>
    class SQLAlchemyConfig(StatefulIngestionConfigBase):
  File "/opt/anaconda3/lib/python3.8/site-packages/datahub/ingestion/source/sql/sql_common.py", line 221, in SQLAlchemyConfig
    from datahub.ingestion.source.ge_data_profiler import GEProfilingConfig
  File "/opt/anaconda3/lib/python3.8/site-packages/datahub/ingestion/source/ge_data_profiler.py", line 27, in <module>
    from great_expectations.core.util import convert_to_json_serializable
  File "/opt/anaconda3/lib/python3.8/site-packages/great_expectations/__init__.py", line 7, in <module>
    from great_expectations.data_context import DataContext
  File "/opt/anaconda3/lib/python3.8/site-packages/great_expectations/data_context/__init__.py", line 1, in <module>
    from .data_context import BaseDataContext, DataContext, ExplorerDataContext
  File "/opt/anaconda3/lib/python3.8/site-packages/great_expectations/data_context/data_context.py", line 29, in <module>
    import great_expectations.checkpoint.toolkit as checkpoint_toolkit
  File "/opt/anaconda3/lib/python3.8/site-packages/great_expectations/checkpoint/__init__.py", line 1, in <module>
    from ..util import verify_dynamic_loading_support
  File "/opt/anaconda3/lib/python3.8/site-packages/great_expectations/util.py", line 35, in <module>
    from great_expectations.core.expectation_suite import (
  File "/opt/anaconda3/lib/python3.8/site-packages/great_expectations/core/__init__.py", line 3, in <module>
    from .expectation_suite import (
  File "/opt/anaconda3/lib/python3.8/site-packages/great_expectations/core/expectation_suite.py", line 10, in <module>
    from great_expectations.core.evaluation_parameters import (
  File "/opt/anaconda3/lib/python3.8/site-packages/great_expectations/core/evaluation_parameters.py", line 27, in <module>
    from great_expectations.core.util import convert_to_json_serializable
  File "/opt/anaconda3/lib/python3.8/site-packages/great_expectations/core/util.py", line 62, in <module>
    import pyspark
  File "/opt/anaconda3/lib/python3.8/site-packages/pyspark/__init__.py", line 46, in <module>
    from pyspark.context import SparkContext
  File "/opt/anaconda3/lib/python3.8/site-packages/pyspark/context.py", line 31, in <module>
    from pyspark import accumulators
  File "/opt/anaconda3/lib/python3.8/site-packages/pyspark/accumulators.py", line 97, in <module>
    from pyspark.cloudpickle import CloudPickler
  File "/opt/anaconda3/lib/python3.8/site-packages/pyspark/cloudpickle.py", line 146, in <module>
    _cell_set_template_code = _make_cell_set_template_code()
  File "/opt/anaconda3/lib/python3.8/site-packages/pyspark/cloudpickle.py", line 127, in _make_cell_set_template_code
    return types.CodeType(
TypeError: an integer is required (got type bytes)
b
Can u please send the content of
snowflake-ingestion.yml
by masking the sensitive info. This will help in understanding the issue in detail.
g
sure thanks, the content in snowflake-ingestion.yml is like this
Copy code
---
source:
  type: snowflake
  config:
    # Coordinates
    host_port: ${SNOWFLAKE_HOST}
    warehouse: "PLATFORM_WH"

    # Credentials
    username: USER_NAME
    password: ${PASSWORD}
    role: "DATA_CATALOG_READER"

    # TODO: This should use privatekey authentication, data catalog reader needs a key
    # authentication: KEY_PAIR_AUTHENTICATOR
    # private_key_path: ...

    database_pattern:
      allow:
          - "^BILLING\$"
          - "^OPERATIONS_ANALYTICS\$"
          - "^PURCHASING\$"
          - "^PRODUCT_ANALYTICS\$"
    schema_pattern:
      allow:
          - "^RAW\$"
          - "^TRANSFORMED_PROD\$"
          - "^PUBLISHED_PROD\$"

    include_tables: true
    include_views: true
    include_table_lineage: true

    # Disable profiling for local execution as it will eat all the credits
    profiling:
        enabled: false

sink:
    type: "datahub-rest"
    config:
        server: "<http://localhost:8080>"
I suspect something’s wrong with my Environment, but not sure what it is
s
Can you please provide the output of these three commands separately so we know what you environment is
Copy code
python -c "import platform; print(platform.platform())"
Copy code
python -c "import sys; print(sys.version); print(sys.executable); import datahub; print(datahub.__file__); print(datahub.__version__);"
Copy code
python3 -c "import sys; print(sys.version); print(sys.executable); import datahub; print(datahub.__file__); print(datahub.__version__);"
Can you remove
---
from the beginning of the file? I don't think we have that in any our files.
Can you try and make
Copy code
include_views: false
include_table_lineage: false
Trying to rule out possibilites here
g
Thanks Aseem, this error was fixed by upgrade pyspark version
s
@square-activity-64562 I have the exactly the same error message. My pyspark version is 3.0.0. Here is my yml:
Copy code
source:
  type: "snowflake"
  config:
    account_id: "my_account.us-east-1"
    warehouse: "sor_wh"
    username: "my_username"
    password: "my_password"
    role: "my_role"
    include_views: false
    include_table_lineage: false
    table_pattern:
      allow:
        - "temp_1"

sink:
  type: "datahub-rest"
  config:
    server: '<http://localhost:8080>'
Appreciate your help.
s
For reference this got solved as per https://datahubspace.slack.com/archives/CUMUWQU66/p1655228079916019?thread_ts=1655209962.522589&amp;cid=CUMUWQU66 @sparse-monitor-9160 Request you to always start a new thread for problems faced instead of posting in older threads. Messages in older threads are harder to track.
s
Figured — thanks Aseem!