DataHub #ingestion

clean-coat-28016

03/31/2022, 12:41 AM

Hi All, I am getting an error when use postgres ingestion recipe. If I remove "platform_instance" from the recipe, the error message goes away. I am using DataHub version 0.8.31. IIRC, this was working for 0.8.29. Error message and recipe are attached. Any pointers about what is wrong?

rds.err.log rds.recipe.yaml

incalculable-forest-10734

03/31/2022, 3:15 AM

Hi guys, I have a question about Removing Metadata from DataHub. I ingested BigQuery tables into DataHub and deleted table from BigQuery that ingested to DataHub. And then I re-ingested BigQuery tables into DataHub. I expected the table to be deleted from DatHub, but it was still there. Why is the table still there? Do I have to manually delete it?

numerous-morning-88512

03/31/2022, 8:52 AM

Hi guys, i have a problem in metabase ingestion it works successfully but I can't find it in datahub

shy-fireman-88724

03/31/2022, 8:27 PM

Hello, were facing some problems with Spark jobs integration. We're creating a scala notebook whose jobs reads from a hive table using

spark.sql()

and writes the data into another hive table. Even though the lineage appears, it has wrong names in the components, in the source it shows the S3 location and in the spark job it shows the method name as you can see in the image bellow. We expected to appear the

schema_name.table_name

instead of the S3 location. Is there something more we can configure? Another question: is the demo source code available somewhere?

cold-hydrogen-10513

04/01/2022, 10:13 AM

hello, I was posting a question regarding connection to snowflake and @incalculable-ocean-74010 helped me with specifying the correct endpoint for my recipe. https://datahubspace.slack.com/archives/CUMUWQU66/p1648554749148879 now I managed to connect to snowflake but I have an issue with getting the metadata. The ingestion job takes about 1.5h and then fails. The stacktrace is the following ( I trimmed it a bit). Could you please give me a hint what can be checked here?

Copy code

'[2022-03-30 16:24:11,842] ERROR    {datahub.entrypoints:152} - File '
           '"/tmp/datahub/ingest/venv-a8e48815-7f1f-4468-958c-3c2b1fcbf48e/lib/python3.9/site-packages/datahub/entrypoints.py", line 138, in main\n'
           '    135  def main(**kwargs):\n'
           '    136      # This wrapper prevents click from suppressing errors.\n'
           '    137      try:\n'
           '--> 138          sys.exit(datahub(standalone_mode=False, **kwargs))\n'
           '    139      except click.exceptions.Abort:\n'
           '    ..................................................\n'
           '     kwargs = {}\n'
           '     datahub = <Group datahub>\n'
           "     click.exceptions.Abort = <class 'click.exceptions.Abort'>\n"
           '    ..................................................\n'
           '\n'
           'File "/tmp/datahub/ingest/venv-a8e48815-7f1f-4468-958c-3c2b1fcbf48e/lib/python3.9/site-packages/click/core.py", line 1130, in __call__\n'
           '    1128  def __call__(self, *args: t.Any, **kwargs: t.Any) -> t.Any:\n'
           ' (...)\n'
           '--> 1130      return self.main(*args, **kwargs)\n'
           '    ..................................................\n'
           '     self = <Group datahub>\n'
           '     args = ()\n'
           '     t.Any = typing.Any\n'
           "     kwargs = {'standalone_mode': False,\n"
           "               'prog_name': 'python3 -m datahub'}\n"
           '    ..................................................\n'
           '\n'
           'File "/tmp/datahub/ingest/venv-a8e48815-7f1f-4468-958c-3c2b1fcbf48e/lib/python3.9/site-packages/click/core.py", line 1055, in main\n'
           '    rv = self.invoke(ctx)\n'
           'File "/tmp/datahub/ingest/venv-a8e48815-7f1f-4468-958c-3c2b1fcbf48e/lib/python3.9/site-packages/click/core.py", line 1657, in invoke\n'
           '    return _process_result(sub_ctx.command.invoke(sub_ctx))\n'
........................
'File "/tmp/datahub/ingest/venv-a8e48815-7f1f-4468-958c-3c2b1fcbf48e/lib/python3.9/site-packages/snowflake/sqlalchemy/snowdialect.py", '
           'line 573, in <listcomp>\n'
           '    return [self.normalize_name(row[1]) for row in cursor]\n'
           'File "/tmp/datahub/ingest/venv-a8e48815-7f1f-4468-958c-3c2b1fcbf48e/lib/python3.9/site-packages/snowflake/sqlalchemy/snowdialect.py", '
           'line 204, in normalize_name\n'
           '    if name.upper() == name and not self.identifier_preparer._requires_quotes(name.lower()):\n'
           'File "/tmp/datahub/ingest/venv-a8e48815-7f1f-4468-958c-3c2b1fcbf48e/lib/python3.9/site-packages/sqlalchemy/sql/compiler.py", line 3613, '
           'in _requires_quotes\n'
           '    or value[0] in self.illegal_initial_characters\n'
           '\n'
           'IndexError: string index out of range\n'
           '[2022-03-30 16:24:11,842] INFO     {datahub.entrypoints:161} - DataHub CLI version: 0.8.31 at '
           '/tmp/datahub/ingest/venv-a8e48815-7f1f-4468-958c-3c2b1fcbf48e/lib/python3.9/site-packages/datahub/__init__.py\n'
           '[2022-03-30 16:24:11,842] INFO     {datahub.entrypoints:164} - Python version: 3.9.9 (main, Dec 21 2021, 10:03:34) \n'
           '[GCC 10.2.1 20210110] at /tmp/datahub/ingest/venv-a8e48815-7f1f-4468-958c-3c2b1fcbf48e/bin/python3 on '
           'Linux-5.4.176-91.338.amzn2.x86_64-x86_64-with-glibc2.31\n'
           "[2022-03-30 16:24:11,842] INFO     {datahub.entrypoints:167} - GMS config {'models': {}, 'versions': {'linkedin/datahub': {'version': "
           "'v0.8.31', 'commit': '2f078c981c86b72145eebf621230ffd445948ef6'}}, 'managedIngestion': {'defaultCliVersion': '0.8.31', 'enabled': True}, "
           "'statefulIngestionCapable': True, 'supportsImpactAnalysis': True, 'telemetry': {'enabledCli': True, 'enabledIngestion': False}, "
           "'retention': 'true', 'noCode': 'true'}\n",
           "2022-03-30 16:24:14.167925 [exec_id=a8e48815-7f1f-4468-958c-3c2b1fcbf48e] INFO: Failed to execute 'datahub ingest'",
           '2022-03-30 16:24:14.168306 [exec_id=a8e48815-7f1f-4468-958c-3c2b1fcbf48e] INFO: Caught exception EXECUTING '
           'task_id=a8e48815-7f1f-4468-958c-3c2b1fcbf48e, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/default_executor.py", line 119, in execute_task\n'
           '    self.event_loop.run_until_complete(task_future)\n'
           '  File "/usr/local/lib/python3.9/site-packages/nest_asyncio.py", line 81, in run_until_complete\n'
           '    return f.result()\n'
           '  File "/usr/local/lib/python3.9/asyncio/futures.py", line 201, in result\n'
           '    raise self._exception\n'
           '  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 256, in __step\n'
           '    result = coro.send(None)\n'
           '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 115, in execute\n'
           '    raise TaskError("Failed to execute \'datahub ingest\'")\n'
           "acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"]}

chilly-oil-22683

04/01/2022, 11:45 AM

Hi, trying to set up ingestion from Athena source: https://datahubproject.io/docs/metadata-ingestion/source_docs/athena#config-details But I'm tryng tor wrap my head around the concept of schema in the context of Athena. AFAIK Athena doesn't know the concept of schema (perhaps under the hood) but to the user we only deal with the concepts

data catalog

database

table

and

view

. So what do you mean by

schema

here?

chilly-oil-22683

04/02/2022, 9:26 AM

Hi, Business Glossary Ingestion doesn't seem to support an S3 URI as file path? Seems more logical to me instead of having to push the glossary file to the EKS cluster or the pods file storage. Mature file libraries should be able to parse common URI's by themselves I think. Am I overlooking something, what URI's does this ingestor support besides local files?

swift-breakfast-25077

04/02/2022, 11:22 AM

hello everyone, currently I'm testing the metadata ingestion, I noticed that it takes all tables, is there a method to specify only the tables considered by the ingestion ? Another question, how we can ingest powerbi reports which are deployed on a remote server (not on my local machine) ?

handsome-minister-84652

04/03/2022, 11:19 PM

Hi team - quick question, is there an API to create a chart? I see one to update, but would like to create one (our charts are kept in a yaml file, just like our domains, we then parse the file and update datahub on our build)

mammoth-fountain-32989

04/04/2022, 9:43 AM

Hi, I want to load metadata (through datahub ingestion UI) from certain schemas and for tables with specific pattern in their names (Postgresql source). My yaml looks similar to this: schema_pattern: allow: - abc - pqr - test table_pattern: allow: - test_base_tbl - check_validations - user_info I assumed it to follow logical AND of these schemas and tables but I see that all objects from the given schemas (irrespective of the object name pattern) are being ingested. Also, is there a way to restrict the views that we ingest. (I am using include_views as True which is pulling all the views) Any sample on how to provide schema and table regex pattern that can be used in conjunction will help. Thanks

most-waiter-95820

04/04/2022, 10:56 AM

Hi all, BigQuery lineage question. Say we have this flow of tables and views: TABLE_A --> VIEW_A --> VIEW_B VIEW_B is based on VIEW_A which in turn is built on top of TABLE_A. I've tried to run an ingestion with a lineage from BigQuery exported audit logs and it returned a graph like: TABLE_A --> VIEW_A ┕--> VIEW_B Is there a way to build a lineage more in a "chained" way to be able to track the order of views built on top of each other?

few-grass-66826

04/04/2022, 3:03 PM

Hi guys, I am trying to ingest snowflake usage from UI with this config but it returns empty results, is there something wrong or my config is not correct? Thanks :) source: type: snowflake-usage config: host_port: ######.eu-west-1 warehouse: COMPUTE_WH username: ######## password: ########## role: ACCOUNTADMIN env: prod top_n_queries: 10 email_domain: #########.com schema_pattern: deny: ◦ 'information_schema.*' sink: type: datahub-rest config: server: 'http://datahub-gms:8080'

quaint-window-7517

04/05/2022, 6:12 AM

Hello guys, I am having a problem in the ingestion UI, I have deployed the datahub to AWS EKS, and switched to use AWS RDS instead of the default MySQL pod, after that, I see these errors from the UI, and I can't create new sources: Unknow error 500

brave-market-65632

04/05/2022, 6:41 AM

Team, I downloaded the quickstart docker image(s) and been playing around with ingestion to start with. Focus is to scan a snowflake db. Snowflake scan with

include_table_lineage: False

works fine. However, when set to `True`the ingestion logs report the following error.

Copy code

[2022-04-05 11:55:31,989] WARNING  {snowflake.connector.vendored.urllib3.connectionpool:780} - Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProtocolError('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))':

When I inspected the logs in

datahub_datahub-actions_1

container, I see the following error.

Copy code

[2022-04-05 06:22:33,752] ERROR    {acryl_action_fwk.source.datahub_streaming:279} - ERROR

Traceback (most recent call last):

  File "/usr/local/lib/python3.9/site-packages/acryl_action_fwk/source/datahub_streaming.py", line 268, in _handle_mae

    match=a.subscriptions()[0],

IndexError: list index out of range

On Snowflake history, I see that the metadata lineage query ran fine and returned ~495K records. Any help is appreciated. Thanks.

nutritious-bird-45843

04/05/2022, 6:49 PM

Hello, guys! hihi We intend to upgrade our platform to use Datahub newest version,

v0.8.32

, so we are doing some local tests, before deploying to stage and production environments. However, we are facing some issues regarding data indexing. For instance, Kafka topics appears inside

Datasets

and also if we query using the search bar. However, clicking on Kafka connector, we receive

No results found for ""

as if the query is searching for nothing. The same occurs for Hive connector. The first image shows the home page and the second shows the issue that happens after clicking on Kafka. One thing worth mentioning is that currently we have the same Kafka and Hive metadata ingested on other envs for Datahub version

and they retrieve the metadata when we click on the connectors. Thanks in advance!

plain-farmer-27314

04/05/2022, 6:58 PM

Hey all, just wondering if there was a built in way to add links to a dataset during ingestion. I assume using a transformer would be the correct place to do this

plain-baker-30549

04/06/2022, 6:18 AM

Hi team, I'm currently testing DataHub as a data catalog and way how to ingest metadata from Snowflake. Pulling metadata using Snowflake recipe works fine, but I wonder if it's also possible (and how) to push data (e.g. from INFORMATION_SCHEMA of Snowflake DB) to DataHub.

nutritious-bird-77396

04/06/2022, 2:42 PM

`A Debug Question`: How do you run

datahub ingest -c <recipe.yaml>

in debug mode in local? Is there an additional option you pass in the command line?

plain-farmer-27314

04/06/2022, 2:45 PM

Hi, wondering what source is being used to ingest the mlmodels here: https://demo.datahubproject.io/browse/mlModels

billowy-flag-4217

04/06/2022, 3:58 PM

Hello, I'm currently using

acryl-datahub=0.8.31.4

and

python=3.8

when attempting to ingest Looker metadata I get the following error.

Copy code

TypeError: You should use `typing_extensions.TypedDict` instead of `typing.TypedDict` with Python < 3.9.2. Without it, there is no way to differentiate required and optional fields when subclassed.

Is it a requirement now to use python 3.9.2 for Looker ingestion, or is there another workaround?

mysterious-lamp-91034

04/06/2022, 7:46 PM

Hello I am running

Copy code

./gradlew :metadata-ingestion:testQuick

on v0.8.32 I am seeing

Copy code

=========================== short test summary info ============================
FAILED tests/integration/looker/test_looker.py::test_looker_ingest - TypeErro...
FAILED tests/integration/looker/test_looker.py::test_looker_ingest_allow_pattern
FAILED tests/integration/lookml/test_lookml.py::test_lookml_ingest - TypeErro...
FAILED tests/integration/lookml/test_lookml.py::test_lookml_ingest_offline - ...
FAILED tests/integration/lookml/test_lookml.py::test_lookml_ingest_offline_platform_instance
FAILED tests/integration/lookml/test_lookml.py::test_lookml_ingest_api_bigquery
FAILED tests/integration/lookml/test_lookml.py::test_lookml_ingest_api_hive
FAILED tests/integration/lookml/test_lookml.py::test_lookml_bad_sql_parser - ...
FAILED tests/integration/lookml/test_lookml.py::test_lookml_github_info - Typ...
FAILED tests/integration/s3/test_s3.py::test_data_lake_local_ingest[folder_no_partition.json]
FAILED tests/integration/s3/test_s3.py::test_data_lake_local_ingest[folder_no_partition_exclude.json]
FAILED tests/integration/s3/test_s3.py::test_data_lake_local_ingest[folder_no_partition_filename.json]
FAILED tests/integration/s3/test_s3.py::test_data_lake_local_ingest[folder_no_partition_glob.json]
FAILED tests/integration/s3/test_s3.py::test_data_lake_local_ingest[folder_partition_basic.json]
FAILED tests/integration/s3/test_s3.py::test_data_lake_local_ingest[folder_partition_keyval.json]
FAILED tests/integration/s3/test_s3.py::test_data_lake_local_ingest[multiple_files.json]
FAILED tests/integration/s3/test_s3.py::test_data_lake_local_ingest[single_file.json]
==== 17 failed, 325 passed, 52 deselected, 30 warnings in 60.13s (0:01:00) =====

billions-twilight-48559

04/06/2022, 9:11 PM

Hi! First of all I’m not new on datahub. I’m trying a new empty setup of datahub on testing for the version 0.8.31. We are executing successfully our ingestion recipes of glossary terms using the datahub cli, all is ok at the CLI and I can se the response status 200 at the gms logs for each insert. But no content appears on the frontend!

orange-coat-2879

04/06/2022, 10:01 PM

Hi! I am working on ingestion from MSSQL (SQLEXPRESS), The

localhost:1433

does not work for me. I have attached my recipe here. Anyone can help? I am not sure if I should place a real URL (http://........) in the host_port. Thanks for helping!

thousands-room-91010

04/07/2022, 3:10 AM

We’d like to add the ability to add entities via graphQL on a local deployment in advance of the feature coming out soon https://datahubproject.io/docs/api/graphql/querying-entities/ Can anyone provide any tips on what would be involved to accomplish this ? Is it as simple as creating a new graphQL mutation that adds a new entry in its internal database ? Thanks for your help

able-rain-74449

04/07/2022, 2:12 PM

Hi all i am getting an error while running

ConfigurationError: datahub-kafka is disabled; try running: pip install 'acryl-datahub[datahub-kafka]'

datahub ingest -c example_to_datahub_kafka.yml --dry-run

see thread. tried running

pip install 'acryl-datahub[datahub-kafka]'

get

Copy code

88a6420d827a/src/confluent_kafka/src/confluent_kafka.h:23:10: fatal error: 'librdkafka/rdkafka.h' file not found
      #include <librdkafka/rdkafka.h>
               ^~~~~~~~~~~~~~~~~~~~~~
      1 error generated.
      error: command '/usr/bin/clang' failed with exit code 1
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure

× Encountered error while trying to install package.
╰─> confluent-kafka

note: This is an issue with the package mentioned above, not pip.
hint: See above for output from the failure.

able-rain-74449

04/07/2022, 4:02 PM

Hi All Anyone got an example of ingesting data Datahub on EKS using what's the best approach?

brave-forest-5974

04/07/2022, 5:13 PM

🤔 In Looker, should I expect to see explores connected to their views? (after both LookML and Dashboard ingestion) Or a better question, what would prevent an explore from connecting to its views? Joins perhaps?

handsome-football-66174

04/07/2022, 6:47 PM

Hi team, we want to use ingest data coming real time . How do we get started ? I understand these are the steps we need to take care - 1. convert metadata into pegasus format 2. create emitter Anything else I need to consider

numerous-eve-42142

04/07/2022, 7:48 PM

Hello! I'm trying to ingest some specific tables from our redshift db to datahub. But there are some tables with names like others. For exemple: The table "db.schema.carrier" have to be ingested So we configured:

Copy code

table_pattern:
     allow:
        - "db.schema.carrier"

...

profile_pattern:
      allow:
        - "db.schema.carrier"`

But this exemples ingest the tables: • carrier • carrier_damage • carrieir_tower_team There is some whey to strictly specify the tables i want?

icy-piano-35127

04/07/2022, 8:05 PM

Hello guys! Is that possible to filter the redash informations bringing just queries and dashboards from a specific datasource?