DataHub #ingestion

eager-monitor-4683

07/03/2023, 4:31 AM

Hi team, I am ingesting DBT manifest successfully. But the lineage does not show the part from external schema to the DBT model itself in datahub, however, I am able to see that from DBT lineage by running DBT docs. Just want to know if its a supported feature in datahub? Thanks

✅ 1

future-yak-13169

07/03/2023, 6:35 AM

Hi Community - we are on version 10.3 and have been running Datahub for a year now. We have deployed on Kubernetes cluster and have our elasticsearch in the same cluster but MySQL DB is outside the cluster. We have been seeing issues whenever we ingest new data either new platform or existing platform-new data or even access tokens sometimes, the metadata doesnt show up on the UI even though it has registered in the MySQL DB. The restore indices job is deployed as a cronjob but that also doest help. Only by deleting our storage PVCs and redeploying prerequisites and components, then the new data starts to show up on UI. Can someone guide me on where could the problem be? Please revert in case of more info required.

abundant-apartment-78179

07/03/2023, 9:22 AM

Hey Guys, I have integrated Sagemaker successfully and I see that there are resources has been scanned. However, nothing has been ingested into the Datahub. Did I misconfigured something?

delightful-school-94725

07/03/2023, 11:26 AM

Hi @witty-plumber-82249, I have been trying to use classification during ingestion of metadata via snowflake and I am defining all the config parameters, but It is not assigning any glossary terms. Can you please help me what am I missing and how can I do that? Below is how classification in my recipe yaml file looks like -

classification:

enabled: true

info_type_to_term:

Email_Address: Email

classifiers:

- type: datahub

config:

confidence_level_threshold: 0.7

info_types_config:

Street_Address:

prediction_factors_and_weights:

name: 1

description: 0

datatype: 0

values: 0

name:

regex:

- Account_Territory

- account_territory

datatype:

type:

- str

values:

prediction_type: library

regex: []

library:

- spacy

Full_name:

prediction_factors_and_weights:

name: 1

description: 0

datatype: 0

values: 0

name:

regex:

- AccountName

- accountname

datatype:

type:

- str

values:

prediction_type: regex

regex:

- '^[a-zA-Z ]+.*'

library: []

stocky-guitar-68560

07/03/2023, 2:25 PM

hi team, I am using the following code to emit the metadata from the python script.

Copy code

import datahub.emitter.mce_builder as builder
from datahub.emitter.rest_emitter import DatahubRestEmitter

lineage_mce = builder.make_lineage_mce(
    [
        builder.make_dataset_urn("kafka", "topic-A"),  # Upstream
    ],
    builder.make_dataset_urn("bigquery", "dataset-A"),  # Downstream
)
emitter = DatahubRestEmitter("metabase-gms-endpoint")
emitter.emit_mce(lineage_mce)

the above code generates the lineage between kafka topic-A and bigquery dataset-A but if run the same script with kafka topic-A and bigquery dataset-B, it actully creates a another link between topic-A and dataset-B.Now there will be two edges from topic-A i.e from topic-A to dataset-A and topic-A to dataset-B. I want to override the existing lineage. I only want the latest ingested lineage i.e, topic-A to dataset-B. Can someone help me in this?

✅ 1

ripe-lock-98414

07/04/2023, 4:23 AM

Hi team, About ingest dbt source, Fact: I define a model and I use this model as a source of another model, then after ingestion, I can only explore this source on web UI and I didn't see anything about this model (upstream, downstream).

shy-dog-84302

07/04/2023, 5:48 AM

Deleting stale data? Hi, I have a lot of metadata which shows synchronized for over a month ago since I introduced

staleness

flag into ingestion configuration. I am looking for a safe ways to query and soft/hard delete those entires? Can someone help me with

datahub delete

command or GraphQL query that can give me URNs to such data in DataHub?

✅ 1

bland-orange-13353

07/04/2023, 8:34 AM

This message was deleted.

worried-butcher-72025

07/04/2023, 10:27 AM

Hello @witty-plumber-82249,I hope you can help. I am trying to add a data source through the UI and I am receiving the following error? would you be able to advise how to fix it? Thank you so much in advance

Copy code

Execution finished with errors.
{'exec_id': '61e7151b-2e5c-4d4d-8336-88becfa736c3',
 'infos': ['2023-07-04 10:20:22.711650 INFO: Starting execution for task with name=RUN_INGEST',
           "2023-07-04 10:21:01.296376 INFO: Failed to execute 'datahub ingest'",
           '2023-07-04 10:21:01.296911 INFO: Caught exception EXECUTING task_id=61e7151b-2e5c-4d4d-8336-88becfa736c3, name=RUN_INGEST, '
           'stacktrace=Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/default_executor.py", line 122, in execute_task\n'
           '    task_event_loop.run_until_complete(task_future)\n'
           '  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete\n'
           '    return future.result()\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 231, in execute\n'
           '    raise TaskError("Failed to execute \'datahub ingest\'")\n'
           "acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"],
 'errors': []}

This is the final:

Copy code

packages/pydantic/_internal/_generate_schema.py", line 578, in _arbitrary_type_schema
    raise PydanticSchemaGenerationError(
pydantic.errors.PydanticSchemaGenerationError: Unable to generate pydantic-core schema for datahub.utilities.lossy_collections.LossyList[str]. Set `arbitrary_types_allowed=True` in the model_config to ignore this error or implement `__get_pydantic_core_schema__` on your type to fully support it.

If you got this error by calling handler(<some type>) within `__get_pydantic_core_schema__` then you likely need to call `handler.generate_schema(<some type>)` since we do not call `__get_pydantic_core_schema__` on `<some type>` otherwise to avoid infinite recursion.

For further information visit <https://errors.pydantic.dev/2.0/u/schema-for-unknown-type>

✅ 1

plus1 4

astonishing-dusk-99990

07/05/2023, 10:36 AM

This message contains interactive elements.

exec-urn_li_dataHubExecutionRequest_013e3e71-997f-4542-be25-75d246c37bb3.log

plus1 1

✅ 1

limited-forest-73733

07/05/2023, 2:52 PM

Hey team! I am doing dbt and snowflake introspection to datahub via airflow, i am installing acryl-datahub plugin in airflow of version 0.10.2.3 but not getting all the dbt metadata ( snowflake and dbt metadata are not composing) , can anyone please guide me what all plugins should need to install in airflow to do the ingestion. Thanks

bitter-waitress-17567

07/05/2023, 5:45 PM

Hi Everyone. We are ingesting DBT into datahub ( v0.10.4) , but facing below error. Can you please let us know what exactly wrong here.

Copy code

if not username.startswith("urn:li:corpuser:")
AttributeError: 'list' object has no attribute 'startswith'

bitter-waitress-17567

07/05/2023, 5:46 PM

Untitled

bitter-waitress-17567

07/05/2023, 5:46 PM

Detail logs

bitter-waitress-17567

07/05/2023, 5:46 PM

Thanks in advance

rich-crowd-33361

07/06/2023, 12:02 AM

Hi Team ,can some one tell if we can ingest metadata of Matillion (Jobs)

✅ 1

quiet-scientist-40341

07/06/2023, 3:04 AM

Hi All. When you send MCPW to datahub, but the MCPW do not handler by datahub. Why? there is no exception. if MCPW miss field

quiet-scientist-40341

07/06/2023, 3:05 AM

Hi All. When you send MCPW to datahub, but the MCPW do not handler by datahub. Why? there is no exception. if MCPW miss fields? how desicion fields of Aspect is must or not must?

worried-rocket-84695

07/06/2023, 5:23 AM

hi all .. i am receiving data from Kafka into my mongoDB , but i am unable to automatically generate the lineage . I have created the Ingestion on Kafka and MongoDB as well . Can someone help me out with this

many-rocket-80549

07/06/2023, 9:52 AM

Hi all, we are trying to perform an ingestion of a SAP Hana system. We have installed both

Copy code

pip install 'acryl-datahub[hana]'

and

pip install pyhdb

That are mentioned in the documentation. However we are still seeing the following error:

Copy code

~~~~ Execution Summary - RUN_INGEST ~~~~
Execution finished with errors.
{'exec_id': '38bab7e3-419e-45f2-a56e-7563b182c83d',
 'infos': ['2023-07-06 09:49:46.028237 INFO: Starting execution for task with name=RUN_INGEST',
           "2023-07-06 09:49:50.106268 INFO: Failed to execute 'datahub ingest'",
           '2023-07-06 09:49:50.106989 INFO: Caught exception EXECUTING task_id=38bab7e3-419e-45f2-a56e-7563b182c83d, name=RUN_INGEST, '
           'stacktrace=Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/default_executor.py", line 122, in execute_task\n'
           '    task_event_loop.run_until_complete(task_future)\n'
           '  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete\n'
           '    return future.result()\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 231, in execute\n'
           '    raise TaskError("Failed to execute \'datahub ingest\'")\n'
           "acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"],
 'errors': []}

~~~~ Ingestion Report ~~~~
{
  "cli": {
    "cli_version": "0.10.0.7",
    "cli_entry_location": "/usr/local/lib/python3.10/site-packages/datahub/__init__.py",
    "py_version": "3.10.10 (main, Mar 14 2023, 02:37:11) [GCC 10.2.1 20210110]",
    "py_exec_path": "/usr/local/bin/python",
    "os_details": "Linux-5.15.0-76-generic-x86_64-with-glibc2.31",
    "peak_memory_usage": "75.97 MB",
    "mem_info": "75.97 MB"
  },
  "source": {
    "type": "hana",
    "report": {
      "events_produced": 0,
      "events_produced_per_sec": 0,
      "entities": {},
      "aspects": {},
      "warnings": {},
      "failures": {},
      "soft_deleted_stale_entities": [],
      "tables_scanned": 0,
      "views_scanned": 0,
      "entities_profiled": 0,
      "filtered": [],
      "start_time": "2023-07-06 09:49:47.606186 (now)",
      "running_time": "0.19 seconds"
    }
  },
  "sink": {
    "type": "datahub-rest",
    "report": {
      "total_records_written": 0,
      "records_written_per_second": 0,
      "warnings": [],
      "failures": [],
      "start_time": "2023-07-06 09:49:47.395413 (now)",
      "current_time": "2023-07-06 09:49:47.793147 (now)",
      "total_duration_in_seconds": 0.4,
      "gms_version": "v0.10.3",
      "pending_requests": 0
    }
  }
}

~~~~ Ingestion Logs ~~~~
Obtaining venv creation lock...
Acquired venv creation lock
venv setup time = 0
This version of datahub supports report-to functionality
datahub  ingest run -c /tmp/datahub/ingest/38bab7e3-419e-45f2-a56e-7563b182c83d/recipe.yml --report-to /tmp/datahub/ingest/38bab7e3-419e-45f2-a56e-7563b182c83d/ingestion_report.json
[2023-07-06 09:49:47,325] INFO     {datahub.cli.ingest_cli:173} - DataHub CLI version: 0.10.0.7
[2023-07-06 09:49:47,400] INFO     {datahub.ingestion.run.pipeline:184} - Sink configured successfully. DataHubRestEmitter: configured to talk to <http://datahub-gms:8080>
[2023-07-06 09:49:47,627] INFO     {datahub.ingestion.run.pipeline:201} - Source configured successfully.
[2023-07-06 09:49:47,628] INFO     {datahub.cli.ingest_cli:129} - Starting metadata ingestion
[2023-07-06 09:49:47,800] INFO     {datahub.ingestion.reporting.file_reporter:52} - Wrote UNKNOWN report successfully to <_io.TextIOWrapper name='/tmp/datahub/ingest/38bab7e3-419e-45f2-a56e-7563b182c83d/ingestion_report.json' mode='w' encoding='UTF-8'>
[2023-07-06 09:49:47,801] INFO     {datahub.cli.ingest_cli:134} - Source (hana) report:
{'events_produced': 0,
 'events_produced_per_sec': 0,
 'entities': {},
 'aspects': {},
 'warnings': {},
 'failures': {},
 'soft_deleted_stale_entities': [],
 'tables_scanned': 0,
 'views_scanned': 0,
 'entities_profiled': 0,
 'filtered': [],
 'start_time': '2023-07-06 09:49:47.606186 (now)',
 'running_time': '0.19 seconds'}
[2023-07-06 09:49:47,801] INFO     {datahub.cli.ingest_cli:137} - Sink (datahub-rest) report:
{'total_records_written': 0,
 'records_written_per_second': 0,
 'warnings': [],
 'failures': [],
 'start_time': '2023-07-06 09:49:47.395413 (now)',
 'current_time': '2023-07-06 09:49:47.801202 (now)',
 'total_duration_in_seconds': 0.41,
 'gms_version': 'v0.10.3',
 'pending_requests': 0}
[2023-07-06 09:49:48,004] ERROR    {datahub.entrypoints:188} - Command failed: Can't load plugin: sqlalchemy.dialects:hana.hdbcli
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/datahub/entrypoints.py", line 175, in main
    sys.exit(datahub(standalone_mode=False, **kwargs))
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/datahub/telemetry/telemetry.py", line 379, in wrapper
    raise e
  File "/usr/local/lib/python3.10/site-packages/datahub/telemetry/telemetry.py", line 334, in wrapper
    res = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/datahub/utilities/memory_leak_detector.py", line 95, in wrapper
    return func(ctx, *args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 198, in run
    loop.run_until_complete(run_func_check_upgrade(pipeline))
  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 158, in run_func_check_upgrade
    ret = await the_one_future
  File "/usr/local/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 149, in run_pipeline_async
    return await loop.run_in_executor(
  File "/usr/local/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 140, in run_pipeline_to_completion
    raise e
  File "/usr/local/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 132, in run_pipeline_to_completion
    pipeline.run()
  File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 339, in run
    for wu in itertools.islice(
  File "/usr/local/lib/python3.10/site-packages/datahub/utilities/source_helpers.py", line 85, in auto_stale_entity_removal
    for wu in stream:
  File "/usr/local/lib/python3.10/site-packages/datahub/utilities/source_helpers.py", line 36, in auto_status_aspect
    for wu in stream:
  File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/source/sql/sql_common.py", line 505, in get_workunits_internal
    for inspector in self.get_inspectors():
  File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/source/sql/sql_common.py", line 379, in get_inspectors
    engine = create_engine(url, **self.config.options)
  File "<string>", line 2, in create_engine
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/util/deprecations.py", line 309, in warned
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/create.py", line 522, in create_engine
    entrypoint = u._get_entrypoint()
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/url.py", line 655, in _get_entrypoint
    cls = registry.load(name)
  File "/usr/local/lib/python3.10/site-packages/sqlalchemy/util/langhelpers.py", line 343, in load
    raise exc.NoSuchModuleError(
sqlalchemy.exc.NoSuchModuleError: Can't load plugin: sqlalchemy.dialects:hana.hdbcli

witty-butcher-82399

07/06/2023, 10:32 AM

Hi! I'm checking GraphQL mutations https://datahubproject.io/docs/graphql/mutations, is there any reason why creating a dataset is missed? Many entities can be created with the GraphQL API such as Groups, Domains, GlossaryNodes|Terms, Data Product and Tag to name a few, however Dataset is missed. Any reason why?

✅ 1

fancy-monitor-63529

07/06/2023, 11:42 AM

Hi everyone, I was wondering if I could get some help,I am migrating my bigquery ingestion image from v0.8.44 to v0.10.0, but I dont know why even though I am using the same exact dataset I am getting a strange failure

failures': {'lineage-exported-gcp-audit-logs': ['Error: 400 Could not cast literal "20230703" to type TIMESTAMP at [13:18]\n\nLocation: US\nJob ID: (removed by me)\n']},

I will attach the the old and new recipes below. I have gone over the wiki many times can not tell what I am missing. Perhaps my service account needs new permissions now.

✅ 1

quaint-appointment-83049

07/06/2023, 12:20 PM

Hi Team, BIGQUERY LINEAGE ISSUE IN DATAHUB We use Datahub for our DataCatalog and the cloud source provider is Google Cloud. We ingested almost all the Bigquery tables into Datahub. Ironically, I found for certain projects, the table lineage is shown (Downstream/Upstream) only if the tables are linked within that project. But not across the projects. I need the Lineage information of my Tables ACROSS PROJECTS. Do I miss any configuration? It is not any access issue, because I use the same Service Account for all the projects. For some projects, I could see entire Lineage across project level for the tables and for some project I cannot. Could you please help me if someone is able to resolve this issue? or getting the same issue? I am using the below Pipeline configuration.

Copy code

"pipeline_name": f"bigquery_metadata_ingestion_{ingestion.project_id}",
            "source": {
                "type": "bigquery",
                "config": {
                    "env": worker_event.environment,
                    "project_id": f"{ingestion.project_id}",
                    "project_on_behalf": config.PROJECT_ID,
                    "profiling": {"enabled": False},
                    "column_limit": 900,
                    "use_exported_bigquery_audit_metadata": False,
                    "match_fully_qualified_names": True,
                    "dataset_pattern": {
                        # Specify datasets to be excluded
                        "deny": ingestion.exclusion_dataset_patterns,
                    },
                    "table_pattern": {
                        # Specify tables to be excluded
                        "deny": ingestion.exclusion_table_patterns,
                    },
                    "view_pattern": {
                        # Specify views to be excluded
                        "deny": ingestion.exclusion_view_patterns,
                    },
                    "stateful_ingestion": {"enabled": True},
                    # credential add BigQuery Credential for pipline source
                    # <https://datahubproject.io/docs/generated/ingestion/sources/bigquery#cli-based-ingestion-2>
                    "credential": self.credential,
                },
            },
            "sink": {
                "type": "datahub-rest",
                "config": {
                    "server": config.DATAHUB_SERVER,
                    "token": config.DATAHUB_TOKEN,
                    "retry_max_times": 4,
                    "max_threads": 3,
                },
            },

brainy-butcher-66683

07/06/2023, 2:06 PM

Hi team my ui ingestion has been stuck at the error below for over 12 hours now I have attached my yaml recipe also

Copy code

source:
    type: mysql
    config:
        host_port: '********'
        database: null
        username: ****
        include_tables: true
        include_views: false
        profiling:
            enabled: true
            profile_table_level_only: false
        stateful_ingestion:
            enabled: true
        password: '${courier_chat_na}'
        schema_pattern:
            allow:
                - courier_chat
sink:
    type: datahub-rest
    config:
        server: '<datahuh url>/api/gms'
        token: '${GMS_key}'

Copy code

WARNING: These logs appear to be stale. No new logs have been received since 2023-07-05 23:25:45.280969 (297 seconds ago). However, the ingestion process still appears to be running and may complete normally.

✅ 1

acceptable-computer-51491

07/06/2023, 2:56 PM

Hi Guys, While ingesting data from Glue using UI, I am getting following issue. Any ideas

Copy code

[2023-07-06 09:16:10,979] DEBUG    {datahub.emitter.rest_emitter:247} - Attempting to emit to DataHub GMS; using curl equivalent to:\n',
           '2023-07-06 09:16:11.149010 [exec_id=280a9dbb-5208-4212-95ee-d28a9e4d4afc] INFO: Caught exception EXECUTING '
           'task_id=280a9dbb-5208-4212-95ee-d28a9e4d4afc, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.10/asyncio/streams.py", line 525, in readline\n'
           '    line = await self.readuntil(sep)\n'
           '  File "/usr/local/lib/python3.10/asyncio/streams.py", line 603, in readuntil\n'
           '    raise exceptions.LimitOverrunError(\n'
           'asyncio.exceptions.LimitOverrunError: Separator is not found, and chunk exceed the limit\n'
           '\n'
           'During handling of the above exception, another exception occurred:\n'
           '\n'
           'Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/default_executor.py", line 123, in execute_task\n'
           '    task_event_loop.run_until_complete(task_future)\n'
           '  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete\n'
           '    return future.result()\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 147, in execute\n'
           '    await tasks.gather(_read_output_lines(), _report_progress(), _process_waiter())\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 99, in _read_output_lines\n'
           '    line_bytes = await ingest_process.stdout.readline()\n'
           '  File "/usr/local/lib/python3.10/asyncio/streams.py", line 534, in readline\n'
           '    raise ValueError(e.args[0])\n'
           'ValueError: Separator is not found, and chunk exceed the limit\n']}

Datahub => v0.8.45 deployed on AWS EKS

delightful-school-94725

07/06/2023, 5:39 PM

Hi Team, Can you please suggest how can I ingest procedures and functions from the snowflake database?

✅ 1

bitter-waitress-17567

07/06/2023, 6:46 PM

Hi everyone. Have anyone faced and aware of below error. I have enabled the authentication on metadata services and added the details in sink also. But DBT ingestion is failing here

rich-restaurant-61261

07/06/2023, 10:42 PM

Hi Team, I am trying to ingest airflow data into datahub, and following this instruction https://datahubproject.io/docs/lineage/airflow, and stuck at the step to configure an Airflow hook for Datahub, is anyone know where should I setup the airflow hook, is that something I should define at airflow cli or airflow values file?

✅ 1

delightful-school-94725

07/07/2023, 12:33 PM

Hi Team, can you please guide on how can we use SQL parser for getting lineage options?

numerous-address-22061

07/07/2023, 5:15 PM

Has anyone ingested Tableau Cloud/Online?

✅ 1