DataHub #ingestion

Join Slack

brave-pencil-21289

06/14/2022, 7:00 AM

Facing the attached error while doing teradata ingestion.

mysterious-nail-70388

06/14/2022, 7:26 AM

Hello , is there a version limit for the Kafka component DataHub uses, such as being above a certain version

wonderful-quill-11255

06/14/2022, 7:29 AM

Hello. I have a question regarding implementing custom transformers. Is building a transformer upon MCEs considered legacy nowadays? The LegacyMCETransformer and the DatasetTransformer indicates this. If so, is there a recommended way to implement a transformer that needs multiple aspects for a given entity to perform its job? Perhaps staying with directly implementing the Transformer interface?

mysterious-nail-70388

06/14/2022, 7:35 AM

Hello, does DataHub support the use of non-container Kerberos-based Kafka components

bright-cpu-56427

06/14/2022, 7:48 AM

Hi guys I want to use profiling in glue. However, looking at the description, I am not sure what value to assign. what is parameter name??

rhythmic-flag-69887

06/14/2022, 8:45 AM

Hello, Im trying to ingest dbt and I ran the ingest recipe command. I then got this as a response, why is it an error??

Copy code

ERROR    {datahub.entrypoints:165} - You seem to have connected to the frontend instead of the GMS endpoint. The rest emitter should connect to DataHub GMS (usually <datahub-gms-host>:8080) or Frontend GMS API (usually <frontend>:9002/api/gms)

Also what am I to expect if i get dbt working? Will i see the lineage in datahub similar to what dbt shows?

sparse-monitor-9160

06/14/2022, 12:32 PM

Hello everyone. I set up datahub locally and try to ingest data source from Snowflake through CLI. Got the error:

Copy code

[2022-06-14 08:27:04,506] INFO     {datahub.cli.ingest_cli:99} - DataHub CLI version: 0.8.38
[2022-06-14 08:27:10,903] ERROR    {datahub.entrypoints:167} - Stackprinter failed while formatting <FrameInfo /usr/local/lib/python3.9/site-packages/datahub/ingestion/source/sql/sql_common.py, line 270, scope SQLAlchemyConfig>:
  File "/usr/local/lib/python3.9/site-packages/stackprinter/frame_formatting.py", line 225, in select_scope
    raise Exception("Picked an invalid source context: %s" % info)
Exception: Picked an invalid source context: [270], [219], dict_keys([219, 220])

So here is your original traceback at least:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/datahub/cli/ingest_cli.py", line 106, in run
    pipeline = Pipeline.create(pipeline_config, dry_run, preview, preview_workunits)
  File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/run/pipeline.py", line 202, in create
    return cls(
  File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/run/pipeline.py", line 149, in __init__
    source_class = source_registry.get(source_type)
  File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/api/registry.py", line 126, in get
    tp = self._ensure_not_lazy(key)
  File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/api/registry.py", line 84, in _ensure_not_lazy
    plugin_class = import_path(path)
  File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/api/registry.py", line 32, in import_path
    item = importlib.import_module(module_name)
  File "/usr/local/Cellar/python@3.9/3.9.7/Frameworks/Python.framework/Versions/3.9/lib/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 850, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/source/sql/snowflake.py", line 29, in <module>
    from datahub.ingestion.source.sql.sql_common import (
  File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/source/sql/sql_common.py", line 236, in <module>
    class SQLAlchemyConfig(StatefulIngestionConfigBase):
  File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/source/sql/sql_common.py", line 270, in SQLAlchemyConfig
    from datahub.ingestion.source.ge_data_profiler import GEProfilingConfig
  File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/source/ge_data_profiler.py", line 12, in <module>
    from great_expectations import __version__ as ge_version
  File "/usr/local/lib/python3.9/site-packages/great_expectations/__init__.py", line 7, in <module>
    from great_expectations.data_context import DataContext
  File "/usr/local/lib/python3.9/site-packages/great_expectations/data_context/__init__.py", line 1, in <module>
    from great_expectations.data_context.data_context import (
  File "/usr/local/lib/python3.9/site-packages/great_expectations/data_context/data_context/__init__.py", line 1, in <module>
    from great_expectations.data_context.data_context.base_data_context import (
  File "/usr/local/lib/python3.9/site-packages/great_expectations/data_context/data_context/base_data_context.py", line 20, in <module>
    from great_expectations.core.config_peer import ConfigPeer
  File "/usr/local/lib/python3.9/site-packages/great_expectations/core/__init__.py", line 3, in <module>
    from .expectation_suite import (
  File "/usr/local/lib/python3.9/site-packages/great_expectations/core/expectation_suite.py", line 10, in <module>
    from great_expectations.core.evaluation_parameters import (
  File "/usr/local/lib/python3.9/site-packages/great_expectations/core/evaluation_parameters.py", line 27, in <module>
    from great_expectations.core.util import convert_to_json_serializable
  File "/usr/local/lib/python3.9/site-packages/great_expectations/core/util.py", line 22, in <module>
    from great_expectations.types import SerializableDictDot
  File "/usr/local/lib/python3.9/site-packages/great_expectations/types/__init__.py", line 15, in <module>
    import pyspark
  File "/usr/local/lib/python3.9/site-packages/pyspark/__init__.py", line 51, in <module>
    from pyspark.context import SparkContext
  File "/usr/local/lib/python3.9/site-packages/pyspark/context.py", line 31, in <module>
    from pyspark import accumulators
  File "/usr/local/lib/python3.9/site-packages/pyspark/accumulators.py", line 97, in <module>
    from pyspark.serializers import read_int, PickleSerializer
  File "/usr/local/lib/python3.9/site-packages/pyspark/serializers.py", line 72, in <module>
    from pyspark import cloudpickle
  File "/usr/local/lib/python3.9/site-packages/pyspark/cloudpickle.py", line 145, in <module>
    _cell_set_template_code = _make_cell_set_template_code()
  File "/usr/local/lib/python3.9/site-packages/pyspark/cloudpickle.py", line 126, in _make_cell_set_template_code
    return types.CodeType(
TypeError: an integer is required (got type bytes)

[2022-06-14 08:27:10,904] INFO     {datahub.entrypoints:176} - DataHub CLI version: 0.8.38 at /usr/local/lib/python3.9/site-packages/datahub/__init__.py
[2022-06-14 08:27:10,904] INFO     {datahub.entrypoints:179} - Python version: 3.9.7 (default, Sep  3 2021, 12:36:14) 
[Clang 11.0.0 (clang-1100.0.33.17)] at /usr/local/opt/python@3.9/bin/python3.9 on macOS-10.14.6-x86_64-i386-64bit
[2022-06-14 08:27:10,904] INFO     {datahub.entrypoints:182} - GMS config {'models': {}, 'versions': {'linkedin/datahub': {'version': 'v0.8.38', 'commit': '38718b59b358fc6c564ee982752bf2023533b224'}}, 'managedIngestion': {'defaultCliVersion': '0.8.38', 'enabled': True}, 'statefulIngestionCapable': True, 'supportsImpactAnalysis': True, 'telemetry': {'enabledCli': True, 'enabledIngestion': False}, 'datasetUrnNameCasing': False, 'retention': 'true', 'datahub': {'serverType': 'quickstart'}, 'noCode': 'true'}

Here is my YML (with sensitive data replaced by

my_

Copy code

source:
  type: "snowflake"
  config:
    account_id: "my_account.us-east-1"
    warehouse: "sor_wh"
    username: "my_username"
    password: "my_password"
    role: "my_role"
    include_views: false
    include_table_lineage: false
    table_pattern:
      allow:
        - "temp_1"

sink:
  type: "datahub-rest"
  config:
    server: '<http://localhost:8080>'

Here is my environment:

Copy code

$ python -c "import platform; print(platform.platform())"
Darwin-18.7.0-x86_64-i386-64bit

$ python -c "import sys; print(sys.version); print(sys.executable); import datahub; print(datahub.__file__); print(datahub.__version__);"
2.7.16 (default, Jan 27 2020, 04:46:15)                                                                                                                                                                                         
[GCC 4.2.1 Compatible Apple LLVM 10.0.1 (clang-1001.0.37.14)]
/usr/bin/python
Traceback (most recent call last):
File "<string>", line 1, in <module>
ImportError: No module named datahub

$ python3 -c "import sys; print(sys.version); print(sys.executable); import datahub; print(datahub.__file__); print(datahub.__version__);"
3.9.7 (default, Sep  3 2021, 12:36:14)
[Clang 11.0.0 (clang-1100.0.33.17)]
/usr/local/opt/python@3.9/bin/python3.9
/usr/local/lib/python3.9/site-packages/datahub/__init__.py
0.8.38

wooden-jackal-88380

06/14/2022, 1:09 PM

Do you need to provide an access token for the Airflow integration when the Metadata Service Authentication is enabled? If yes, how do you configure the access token in the Airflow connection?

hundreds-pillow-5032

06/14/2022, 3:17 PM

for the UI ingestion, how can i add “pyodbc”python package directly in dev environment? does not seem to be installed and my mssql encryption enabled ingestion is failing by not finding that package

modern-laptop-12942

06/14/2022, 4:32 PM

Hi everyone. I use airflow in Docker to ingest metadata from snowflake. But here is the error logs.

Copy code

Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.9/site-packages/airflow/models/taskinstance.py", line 1164, in _run_raw_task
    self._prepare_and_execute_task_with_callbacks(context, task)
  File "/home/airflow/.local/lib/python3.9/site-packages/airflow/models/taskinstance.py", line 1282, in _prepare_and_execute_task_with_callbacks
    result = self._execute_task(context, task_copy)
  File "/home/airflow/.local/lib/python3.9/site-packages/airflow/models/taskinstance.py", line 1307, in _execute_task
    result = task_copy.execute(context=context)
  File "/home/airflow/.local/lib/python3.9/site-packages/airflow/operators/python.py", line 150, in execute
    return_value = self.execute_callable()
  File "/home/airflow/.local/lib/python3.9/site-packages/airflow/operators/python.py", line 161, in execute_callable
    return self.python_callable(*self.op_args, **self.op_kwargs)
  File "/opt/airflow/dags/Test_ingestion_dag.py", line 34, in datahub_recipe
    pipeline = Pipeline.create(config)
  File "/home/airflow/.local/lib/python3.9/site-packages/datahub/ingestion/run/pipeline.py", line 150, in create
    return cls(config, dry_run=dry_run, preview_mode=preview_mode)
  File "/home/airflow/.local/lib/python3.9/site-packages/datahub/ingestion/run/pipeline.py", line 116, in __init__
    self.source: Source = source_class.create(
  File "/home/airflow/.local/lib/python3.9/site-packages/datahub/ingestion/source/sql/snowflake.py", line 182, in create
    config = SnowflakeConfig.parse_obj(config_dict)
  File "pydantic/main.py", line 511, in pydantic.main.BaseModel.parse_obj
  File "pydantic/main.py", line 331, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 4 validation errors for SnowflakeConfig
host_port
  field required (type=value_error.missing)
account_id
  extra fields not permitted (type=value_error.extra)
include_view_lineage
  extra fields not permitted (type=value_error.extra)
upstream_lineage_in_report
  extra fields not permitted (type=value_error.extra)

I use source.type: snowflake. And I can successfully ingest using CLI for this recipe.

lemon-zoo-63387

06/15/2022, 1:21 AM

Hello everyone.How to configure in config? Only the table or view under the schema can be seen in the dataset. The system table is not required，Thanks in advance for your help! How to add filter conditions??

Copy code

source:
    type: oracle
    config:
        host_port: '10.xxx.xx.xx4:1521'
        database: Qxxx
        username: dxxxxv
        password: Dxxxxm
sink:
    type: databub-rest
    config:
        server: '<http://localhost:8080>'

dry-doctor-17275

06/15/2022, 1:32 AM

Hello Everyone, I Have a question about the oracle ingestion, Our company used SAP Information Steward to read data catalog and the priviliage auth to our db account is "READ_CATALOG" that means we get data schema from the sys.ALL_DBA_VIEW, TABLE to build the table structure without get raw data. Is datahub can do the same thing to get schema from sys table? or we have to auth our account to get full read privilege ? Our DBA have policy to restrict us read raw data. I real want to replace IS with Datahub can anyone help me to solve the problem thx!

adamant-sugar-28445

06/15/2022, 2:41 AM

Error: No such command 'delete' When I typed "datahub --help", there wasn't the "delete" command shown. I used acryl-datahub, version 0.8.6.1

datahub --help

Usage: datahub [OPTIONS] COMMAND [ARGS]...

Options:

--debug / --no-debug

--version             Show the version and exit.

--help                Show this message and exit.

Commands:

check    Helper commands for checking various aspects of DataHub.

docker   Helper commands for setting up and interacting with a local DataHub instance using Docker.

ingest   Ingest metadata into DataHub.

version  Print version number and exit.

bitter-toddler-42943

06/15/2022, 5:31 AM

Hello, I am wondering that datahub support lineage for MSSQL. Now I am developing datahub in my system and strongly want to use the lineage feature, and check the doc (https://datahubproject.io/docs/generated/ingestion/sources/mssql) but cannot find include_table_lineage option that use for MSSQL. Is there any other way to develope lineage for MSSQL?

high-family-71209

06/15/2022, 7:27 AM

Hi friends, Currently I am ingesting a stream from kafka confluent cloud: Is it possible to copy the descriptions from the confluent control center / schema registry to the datahub? Thanks

adamant-sugar-28445

06/15/2022, 9:07 AM

When I ingested the Hive metadata of a table, the schema shown in datahub looked OK but there're only scant information ("is_view" and "view_definition") in the Properties tab. I expect to see Hive details similar to those in https://demo.datahubproject.io/dataset/urn:li:dataset:(urn:li:dataPlatform:dbt,long_tail[…]ctive_customer_ltv,PROD)/Properties?is_lineage_mode=false. How can I achieve that? The config file I used: `source`:

type: sqlalchemy

config:

platform: hive

connect_uri: "XXXX"

include_views: False

table_pattern:

allow:

- "<MdatabaseName>.<table>"

schema_pattern:

allow:

- "<databaseName>"

deny:

- "<otherSchemas>"

options:

connect_args:

auth: '<AuthStrategy>'

sink:

type: "datahub-rest"

config:

server: "<serverURI>"

astonishing-dusk-99990

06/15/2022, 9:33 AM

Hi All, Does everyone know why I'm ingesting dbt metadata but the status and last execution always N/A and I'm using manual execution and schedule execution like in this pict. Is there something wrong? In the previous version I'm successfully ingested it with the same config *I'm using version v0.8.38

brave-pencil-21289

06/15/2022, 11:53 AM

Can some one help on ms sql server ingestion recipe for windows authentication.

some-kangaroo-13734

06/15/2022, 12:18 PM

👋 Hello 🙂 Is it possible to ingest BigQuery metadata, with the bigquery plugin, for datasets in a project in which I can’t submit jobs? i.e. I have datasets in project A, which I’d like to ingest, and I can only submit jobs in project B. I though that by setting credentials.project_id to project B I would be good to go but that doesn’t seem to be the case. I’m on v0.8.38:

Copy code

source:
    type: bigquery
    config:
        project_id: A
        use_exported_bigquery_audit_metadata: false
        profiling:
            enabled: false
        credential:
            project_id: B
            private_key_id: '${GCP_PRIVATE_KEY_ID}'
            private_key: '${GCP_PRIVATE_KEY}'
            client_id: '${GCP_CLIENT_ID}'
            client_email: '${GCP_CLIENT_EMAIL}'
        domain:
            foo:
                allow:
                    - 'A\..*'
sink:
    type: datahub-rest
    config:
        server: '<https://xxx/api/gms>'
        token: '${GMS_TOKEN}'

Error:

Copy code

'Forbidden: 403 POST <https://bigquery.googleapis.com/bigquery/v2/projects/A/jobs?prettyPrint=false>: Access Denied: Project '
           'A: User does not have bigquery.jobs.create permission in project A.\n'

cuddly-arm-8412

06/15/2022, 4:29 PM

hi,team,when I execute installDev to build the local venv environment,and set python-interpretor in idea. but compilation is still prompt error blew.How can I avoid it

salmon-area-51650

06/15/2022, 5:09 PM

👋Hey there! I have an issue with

dbt

ingestion. The cronjob which is executing the integration was ok (no errors) but I cannot see dbt platform in the UI. This is the configuration of the job:

Copy code

source:
      type: "dbt"
      config:
        # Coordinates
        manifest_path: "<s3://bucket/manifest.json>"
        catalog_path: "<s3://bucket/catalog.json>"
        sources_path: "<s3://bucket/sources.json>"

        aws_connection:
          aws_region: "eu-west-2"

        # Options
        target_platform: "snowflake"
        load_schemas: True # note: if this is disabled
        env: STG
    sink:
      type: "datahub-rest"
      config:
        server: "<http://datahub-datahub-gms:8080>"

And this is the output of the job:

Copy code

'failures': {},
  'cli_version': '0.0.0+docker.b4bf1d4',
  'cli_entry_location': '/usr/local/lib/python3.8/site-packages/datahub/__init__.py',
  'py_version': '3.8.13 (default, May 28 2022, 14:23:53) \n[GCC 10.2.1 20210110]',
  'py_exec_path': '/usr/local/bin/python',
  'os_details': '...-glibc2.2.5',
  'soft_deleted_stale_entities': []}
 Sink (datahub-rest) report:
 {'records_written': 2048,
  'warnings': [],
  'failures': [],
  'downstream_start_time': datetime.datetime(2022, 6, 15, 16, 28, 19, 669215),
  'downstream_end_time': datetime.datetime(2022, 6, 15, 16, 29, 23, 420550),
  'downstream_total_latency_in_seconds': 63.751335,
  'gms_version': 'v0.8.38'}

 Pipeline finished with 1061 warnings in source producing 2048 workunits

And there is no any `dbt`platform in the UI Any idea? Thanks in advance!

mysterious-lamp-91034

06/15/2022, 6:25 PM

Hi I am curious where are

datasetProfile

aspect physically in? I am seeing the dataset stats in the UI. I don't see them in db

Copy code

mysql> select * from metadata_aspect_v2 where aspect='datasetProfile'\G
Empty set (0.00 sec)

numerous-bird-27004

06/15/2022, 8:00 PM

trying to ingest snowflake metadata using the UI based approach. Getting the following error but not sure why: snowflake.connector.network:920} - 000403: HTTP 403: Forbidden. Created a user and role within Snowflake and using that in the recipe as shown on the DataHub doc.

dry-zoo-35797

06/15/2022, 11:09 PM

Hello, I am getting error message below connecting to MS SQL Server. The error is coming from SQLAlchemy. Anyone encountered the same and know the fix for this issue? /sqlalchemy/dialects/mssql/base.py: could not fetch transaction isolation level, tried views. I could reproduce the error writhing a python script using SqlAlchemy URI. But when I added the engine in sqlalchemy session, I was able to make a connection. I don’t know the parameters in recipe file to add the engine in the session. Appreciate your response. Thanks, Mahbub

lemon-zoo-63387

06/16/2022, 1:20 AM

Hello everyone.There are a lot of test data in the dataset. I want to delete them. What should I do

python3 -m datahub docker nuke

For example, only delete Oracle in the dataset

wonderful-egg-79350

06/16/2022, 5:55 AM

Hello Team. I am trying to ingest specific bigquery data but all data is being ingested. Where could I change? This is my yaml:

Copy code

source:
type: bigquery
config:
   project_id: "my-project-id"
   options:
     credentials_path : "./gcp-credential.json"
   table_pattern :
     # Allow ony one table
     allow :
       - "my_dataset.my_table"
 
sink:
type: "datahub-rest"
config:
   server: <http://localhost:8080>

square-lawyer-36076

06/16/2022, 3:00 PM

Hi all, building a fairly basic ingestion for SAP ASE via runnig kubernetis cluster deployed via your helm charts. Ingestion is defined as follows: source: type: sqlalchemy config: connect_uri: 'sybase+pyodbc://foo:bar@myHost:1234/myDb' env: Dev platform: sybase sink: type: datahub-rest config: server: 'http://myURL.com' It fails and after looking through the log it appears that the real culprit is: 'File "/tmp/datahub/ingest/venv-3201b12c-7e85-4b18-8ae4-3b06d010a49a/lib/python3.9/site-packages/sqlalchemy/engine/strategies.py", line ' '87, in create\n' ' dbapi = dialect_cls.dbapi(**dbapi_args)\n' 'File "/tmp/datahub/ingest/venv-3201b12c-7e85-4b18-8ae4-3b06d010a49a/lib/python3.9/site-packages/sqlalchemy/connectors/pyodbc.py", line ' '38, in dbapi\n' ' return __import__("pyodbc")\n' '\n' "ModuleNotFoundError: No module named 'pyodbc'\n" '[2022-06-15 215749,604] INFO {datahub.entrypoints:176} - DataHub CLI version: 0.8.36 at ' How do i run eqivalent of pip install for a given pod?

steep-midnight-37232

06/16/2022, 3:52 PM

Hi, I would like to see the information about the runs history of airflow tasks in datahub as shown in this example: https://demo.datahubproject.io/tasks/urn:li:dataJob:(urn:li:dataFlow:(airflow,datahub_li[…]kend_demo,prod),run_data_task)/Runs?is_lineage_mode=false but I couldn't find the documentation related to that and how to configure it. I have already airflow DAGs ingested in datahub but no Runs are shown. For ex: Thanks for the help

bulky-jackal-3422

06/16/2022, 4:54 PM

Hey all, is there a way for a user to provide the metadata for a custom data source on datahub through the UI? For example, we're ingesting data from an XML API, and we want to know the types for the incoming columns belonging to tables, if a user could register the metadata for the tables, we could grab this data from datahub directly.

billions-morning-53195

06/16/2022, 5:43 PM

Hi everyone, new to DataHub, have a question or need a bit of clarification about this link(https://datahubproject.io/docs/deploy/aws#aws-glue-schema-registry) I am planning to use AWS Glue as our schema registry option. What exactly does the below mean in a broader sense? I wont be able to create an ingestion source as snowflake or ingest from S3 via UI? Appreciate any pointers! Thanks thankyou AWS Glue Schema Registry

WARNING: AWS Glue Schema Registry DOES NOT have a python SDK. As such, python based libraries like ingestion or datahub-actions (UI ingestion) is not supported when using AWS Glue Schema Registry