DataHub #ingestion

wooden-jackal-88380

09/14/2022, 11:06 AM

Hey, I am using the latest version of Datahub and using the snowflake-beta and dbt recipe. I turned on profiling on the Snowflake recipe. When I go to the combined Snowflake + dbt dataset, it does not show the profiling information. It only shows the profiling information when I go to the standalone table sub type. Is this expected behaviour and is there a way to fix this?

✅ 1

square-bird-94136

09/14/2022, 12:48 PM

Hello, I'm trying to ingest data from BigQuery to my local version of DataHub. The script runs successfully, but no info about tables is loaded. In DataHub I just see BigQuery folder with my project name. Recipe is strait-forward(I removed sensitive data):

Copy code

source:
  type: "bigquery"
  config:
    project_id: ""
    credential:
      project_id: ""
      private_key_id: ""
      private_key: ""
      client_email: ""
      client_id: ""
    profiling:
      enabled: false
    include_table_lineage: false
    start_time: "2020-03-01T00:00:00Z"
sink:
  type: "datahub-rest"
  config:
    server: "<http://localhost:8080>"

Script output:

Copy code

Source (bigquery) report:
{'entities_profiled': '0',
 'event_ids': ['container-info...',
               'container-platforminstance...',
               'container-subtypes-...'],
 'events_produced': '3',
 'events_produced_per_sec': '2',
 'failures': {},
 'filtered': [],
 'include_table_lineage': 'False',
 'invalid_partition_ids': {},
 'log_page_size': '1000',
 'partition_info': {},
 'profile_table_selection_criteria': {},
 'running_time': '1.26 seconds',
 'selected_profile_tables': {},
 'soft_deleted_stale_entities': [],
 'start_time': '2022-09-14 14:47:16.617896 (1.26 seconds ago).',
 'table_metadata': {},
 'tables_scanned': '0',
 'upstream_lineage': {},
 'use_date_sharded_audit_log_tables': 'False',
 'use_exported_bigquery_audit_metadata': 'False',
 'use_v2_audit_metadata': 'False',
 'views_scanned': '0',
 'warnings': {},
 'window_end_time': '2022-09-14 12:47:16.343255+00:00 (1.53 seconds ago).',
 'window_start_time': '2020-03-01 00:00:00+00:00 (2 years, 28 weeks and 3 days ago).'}
Sink (datahub-rest) report:
{'current_time': '2022-09-14 14:47:17.875747 (now).',
 'failures': [],
 'gms_version': 'v0.8.44',
 'pending_requests': '0',
 'records_written_per_second': '0',
 'start_time': '2022-09-14 14:46:55.391435 (22.48 seconds ago).',
 'total_duration_in_seconds': '22.48',
 'total_records_written': '3',
 'warnings': []}

✅ 1

quiet-smartphone-60119

09/14/2022, 2:53 PM

Hi folks, With version 8.0.44 - now that CLI ingestions are visible in the ingestion tab, is there a way to assign a custom name to an ingestion recipe similar to the recipes created through the Datahub UI?

salmon-angle-92685

09/14/2022, 3:15 PM

Hello, Is there Stateful Ingestion sor S3 sources? Thank you in advance!

billions-zebra-46597

09/14/2022, 6:04 PM

Are there any example recipes on ingesting from MSSQL with ingrated security/windows auth?

quiet-school-18370

09/14/2022, 7:10 PM

Hi, is it possible to integrate datahub with dbt without the installation of dbt

able-evening-90828

09/14/2022, 7:15 PM

I have a question for MySQL ingestion. If I set up two ingestions from two different MySQL instances, is there any way for me to tell which MySQL instance some dataset is ingested from? Also if the two MySQL instances have some identical table names under identical database names, will they overwrite each other when both are ingested?

cool-boots-36947

09/14/2022, 7:39 PM

Hi. When we try to ingest snowflake database, ingestion is failing.

Copy code

'ProgrammingError: (snowflake.connector.errors.ProgrammingError) 090105 (22000): Cannot perform SELECT. This session does not have a '
           "current database. Call 'USE DATABASE', or use a qualified name.\n"
           '[SQL: \n'
           'select table_catalog, table_schema, table_name\n'
           'from information_schema.tables\n'
           "where last_altered >= to_timestamp_ltz(1663086530849, 3) and table_type= 'BASE TABLE'\n"
           '            ]\n'
           '(Background on this error at: <http://sqlalche.me/e/13/f405>)\n'
           '[2022-09-14 16:28:52,024] INFO     {datahub.entrypoints:187} - DataHub CLI version: 0.8.41 at '
           '/tmp/datahub/ingest/venv-snowflake-0.8.41/lib/python3.9/site-packages/datahub/__init__.py\n'
           '[2022-09-14 16:28:52,024] INFO     {datahub.entrypoints:190} - Python version: 3.9.9 (main, Dec 21 2021, 10:03:34) \n'
           '[GCC 10.2.1 20210110] at /tmp/datahub/ingest/venv-snowflake-0.8.41/bin/python3 on '
           'Linux-5.4.196-108.356.amzn2.x86_64-x86_64-with-glibc2.31\n'
           "[2022-09-14 16:28:52,024] INFO     {datahub.entrypoints:193} - GMS config {'models': {}, 'versions': {'linkedin/datahub': {'version': "
           "'v0.8.42', 'commit': '4f35a6c43dcd058e4e85b1ed7e4818100ab224e0'}}, 'managedIngestion': {'defaultCliVersion': '0.8.41', 'enabled': True}, "
           "'statefulIngestionCapable': True, 'supportsImpactAnalysis': True, 'telemetry': {'enabledCli': True, 'enabledIngestion': False}, "
           "'datasetUrnNameCasing': False, 'retention': 'true', 'datahub': {'serverType': 'prod'}, 'noCode': 'true'}\n",
           "2022-09-14 16:28:53.137401 [exec_id=2dc5382a-f673-489f-b9bf-4cf1328b7bf7] INFO: Failed to execute 'datahub ingest'",
           '2022-09-14 16:28:53.137719 [exec_id=2dc5382a-f673-489f-b9bf-4cf1328b7bf7] INFO: Caught exception EXECUTING '
           'task_id=2dc5382a-f673-489f-b9bf-4cf1328b7bf7, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/default_executor.py", line 122, in execute_task\n'
           '    self.event_loop.run_until_complete(task_future)\n'
           '  File "/usr/local/lib/python3.9/site-packages/nest_asyncio.py", line 89, in run_until_complete\n'
           '    return f.result()\n'
           '  File "/usr/local/lib/python3.9/asyncio/futures.py", line 201, in result\n'
           '    raise self._exception\n'
           '  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 256, in __step\n'
           '    result = coro.send(None)\n'
           '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 112, in execute\n'
           '    raise TaskError("Failed to execute \'datahub ingest\'")\n'
           "acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"]}
Execution finished with errors.

Here is recipe. source: type: snowflake config: username: xxxxx password: xxxxx role: xxxx warehouse: xxxxx check_role_grants: true account_id: xxxxx include_table_lineage: true include_view_lineage: true ignore_start_time_lineage: true upstream_lineage_in_report: true profiling: enabled: true stateful_ingestion: enabled: true database_pattern: allow: - SNOWFLAKE schema_pattern: allow: - ACCOUNT_USAGE pipeline_name: 'urnlidataHubIngestionSource:xxxxxxxxxxxxxxxxxxxxxxxxxxx'

salmon-angle-92685

09/14/2022, 2:38 PM

Hello guys, Is there a way of using Stateful Ingestion with schedules ? I've tried it for Snowflake and it works, but I cannot relaunch the pipeline via UI or set a schedule cron expr. Thank you guys in advance!

proud-table-38689

09/14/2022, 8:38 PM

how can I debug this error when trying to read from Postgres? We’re trying to ingest different schemas from the same postgres database we’re using for DataHub itself

Copy code

~~~~ Execution Summary ~~~~

RUN_INGEST - {'errors': [],
 'exec_id': '194692c4-b85b-4915-afc4-f1ef0f7b7a1b',
 'infos': ['2022-09-14 20:36:52.270206 [exec_id=194692c4-b85b-4915-afc4-f1ef0f7b7a1b] INFO: Starting execution for task with name=RUN_INGEST',
           '2022-09-14 20:36:52.270619 [exec_id=194692c4-b85b-4915-afc4-f1ef0f7b7a1b] INFO: Caught exception EXECUTING '
           'task_id=194692c4-b85b-4915-afc4-f1ef0f7b7a1b, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/default_executor.py", line 121, in execute_task\n'
           '    self.event_loop.run_until_complete(task_future)\n'
           '  File "/usr/local/lib/python3.9/site-packages/nest_asyncio.py", line 89, in run_until_complete\n'
           '    return f.result()\n'
           '  File "/usr/local/lib/python3.9/asyncio/futures.py", line 201, in result\n'
           '    raise self._exception\n'
           '  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 256, in __step\n'
           '    result = coro.send(None)\n'
           '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 71, in execute\n'
           '    validated_args = SubProcessIngestionTaskArgs.parse_obj(args)\n'
           '  File "pydantic/main.py", line 521, in pydantic.main.BaseModel.parse_obj\n'
           '  File "pydantic/main.py", line 341, in pydantic.main.BaseModel.__init__\n'
           'pydantic.error_wrappers.ValidationError: 1 validation error for SubProcessIngestionTaskArgs\n'
           'debug_mode\n'
           '  extra fields not permitted (type=value_error.extra)\n']}
Execution finished with errors.

✅ 1

rough-activity-61346

09/15/2022, 2:12 AM

I have uploaded artifacts (manifest,catalog,sources,results) from Dbt CLoud to Google Cloud Storage through Air flow. Is there any way to access each file in Google Cloud Storage from dbt ingestion on DataHub?

gifted-knife-16120

09/15/2022, 4:03 AM

Hey All. May I know, is there any way to mask the value here? we have multiple cols that store personal information..

proud-table-38689

09/15/2022, 5:11 AM

is there some sort of directory of third-party custom ingestion sources?

limited-forest-73733

09/15/2022, 7:55 AM

Hey team! I am facing error while doing ingestion using datahub-kafka. Ingestion image version is 0.8.41. Can someone please help me out.

better-dinner-64431

09/15/2022, 9:12 AM

Team, I am able to ingest the databricks tables using databricks+pyhive scheme. however after ingestion i do not see lineage for any tables. is this expected behavior?

salmon-angle-92685

09/15/2022, 9:49 AM

Hello guys, We have

redshift-usage

and

redshift

. Can we use both when ingesting in order to have stats and tables metadata ? Or it's only one or other ? Thank you in advance !

salmon-angle-92685

09/15/2022, 11:56 AM

Hello guys, The idea of

snowflake-beta

is of gathering feature both from

snowflake

and

snowflake-usage

? If I use the two combined is the same of using the

-beta

one ? I am asking because since the last one is in "beta" version, I am not so sure if I will implement it already on the company. Thanks !

square-bird-94136

09/15/2022, 12:27 PM

Hello, I'm trying to use UI ingestion and got following error:

Copy code

~~~~ Execution Summary ~~~~

RUN_INGEST - {'errors': [],
 'exec_id': 'fe59f987-686e-4078-8f83-eb1ddf63fc2f',
 'infos': ['2022-09-15 12:22:12.253433 [exec_id=fe59f987-686e-4078-8f83-eb1ddf63fc2f] INFO: Starting execution for task with name=RUN_INGEST',
           '2022-09-15 12:22:48.478154 [exec_id=fe59f987-686e-4078-8f83-eb1ddf63fc2f] INFO: stdout=venv setup time = 0\n'
           'This version of datahub supports report-to functionality\n'
           'datahub  ingest run -c /tmp/datahub/ingest/fe59f987-686e-4078-8f83-eb1ddf63fc2f/recipe.yml --report-to '
           '/tmp/datahub/ingest/fe59f987-686e-4078-8f83-eb1ddf63fc2f/ingestion_report.json\n'
           '[2022-09-15 12:22:34,221] INFO     {datahub.cli.ingest_cli:182} - DataHub CLI version: 0.8.44.2\n'
           '[2022-09-15 12:22:34,243] INFO     {datahub.ingestion.run.pipeline:175} - Sink configured successfully. DataHubRestEmitter: configured '
           'to talk to <http://datahub-datahub-gms:8080>\n'
           '[2022-09-15 12:22:46,903] ERROR    {datahub.entrypoints:192} - \n'
           'Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/run/pipeline.py", line 196, in __init__\n'
           '    self.source: Source = source_class.create(\n'
           '  File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/source/sql/bigquery.py", line 989, in create\n'
           '    config = BigQueryConfig.parse_obj(config_dict)\n'
           '  File "pydantic/main.py", line 521, in pydantic.main.BaseModel.parse_obj\n'
           '  File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/source_config/sql/bigquery.py", line 69, in __init__\n'
           '    super().__init__(**data)\n'
           '  File "pydantic/main.py", line 341, in pydantic.main.BaseModel.__init__\n'
           'pydantic.error_wrappers.ValidationError: 1 validation error for BigQueryConfig\n'
           'include_view_lineage\n'
           '  extra fields not permitted (type=value_error.extra)\n'
           '\n'
           'The above exception was the direct cause of the following exception:\n'
           '\n'
           'Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.9/site-packages/datahub/cli/ingest_cli.py", line 197, in run\n'
           '    pipeline = Pipeline.create(\n'
           '  File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/run/pipeline.py", line 317, in create\n'
           '    return cls(\n'
           '  File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/run/pipeline.py", line 202, in __init__\n'
           '    self._record_initialization_failure(\n'
           '  File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/run/pipeline.py", line 129, in _record_initialization_failure\n'
           '    raise PipelineInitError(msg) from e\n'
           'datahub.ingestion.run.pipeline.PipelineInitError: Failed to configure source (bigquery)\n'
           '[2022-09-15 12:22:46,903] ERROR    {datahub.entrypoints:195} - Command failed: \n'
           '\tFailed to configure source (bigquery) due to \n'
           "\t\t'1 validation error for BigQueryConfig\n"
           'include_view_lineage\n'
           "  extra fields not permitted (type=value_error.extra)'.\n"
           '\tRun with --debug to get full stacktrace.\n'
           "\te.g. 'datahub --debug ingest run -c /tmp/datahub/ingest/fe59f987-686e-4078-8f83-eb1ddf63fc2f/recipe.yml --report-to "
           "/tmp/datahub/ingest/fe59f987-686e-4078-8f83-eb1ddf63fc2f/ingestion_report.json'\n",
           "2022-09-15 12:22:48.478380 [exec_id=fe59f987-686e-4078-8f83-eb1ddf63fc2f] INFO: Failed to execute 'datahub ingest'",
           '2022-09-15 12:22:48.478596 [exec_id=fe59f987-686e-4078-8f83-eb1ddf63fc2f] INFO: Caught exception EXECUTING '
           'task_id=fe59f987-686e-4078-8f83-eb1ddf63fc2f, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/default_executor.py", line 123, in execute_task\n'
           '    task_event_loop.run_until_complete(task_future)\n'
           '  File "/usr/local/lib/python3.9/asyncio/base_events.py", line 642, in run_until_complete\n'
           '    return future.result()\n'
           '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 168, in execute\n'
           '    raise TaskError("Failed to execute \'datahub ingest\'")\n'
           "acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"]}
Execution finished with errors.

alert-fall-82501

09/15/2022, 1:43 PM

Hi Team - I have just quick question what will happen , if I stopped metadata ingestion in between ? I mean , is there any loss to the data ? .. I am ingesting metadata from hive source.

microscopic-table-21578

09/15/2022, 2:03 PM

Is there support for indexing Snowsight dashboards in DH?

bumpy-journalist-41369

09/15/2022, 3:08 PM

Can someone give a sample recipe.yaml where you specify both paths to include and to exclude? I am trying to ingest data from s3 data lake.

gifted-barista-13026

09/15/2022, 3:21 PM

Hello there! We’ve been trying out DataHub for about a week, ingesting metadata from 3 different sources: Hive, Spark and Metabase, and we came across with one issue: - we scanned the datasets from Hive (data is stored in s3) - have our pipelines interacting with those datasets in Spark (we connect Spark with the same Hive) - exploring the datasets from Metabase (we connect Metabase with Trino, who connects to the same Hive) The thing is that DataHub doesn’t realize that the datasets are all the same, and repeats the same datasets 3 times, 1 for each ingestion Is there a way to fix this?

plus1 2

green-lion-58215

09/15/2022, 5:59 PM

How can we change the environment fabrick for our data sets? I see that all our metadata ingested is shown in PROD fabric. how to specify DEV/STAGING?

flat-painter-78331

09/15/2022, 6:52 PM

Hi guys, I'm looking to create a fresh datahub instance in another machine with the same metadata information in my local instance (ex: tags, glossaries, descriptions) without having to manually re-enter them in the new instance. Therefore, is there any way I can pass these tags, glossaries, etc through a recipe file similar to how we add a source connector for ingestion?

agreeable-farmer-44067

09/15/2022, 7:42 PM

Hi All! I'm newbie with Datahub. In my company we are evaluating governance tools and I understand that datahub is great option, we use Apache Nifi, Apache Hive, Apache Spark and HDFS. I installed in docker, but I don't have the Nifi connector. Can you help me with this integration?

brainy-table-99728

09/15/2022, 1:52 PM

Hey there, I'm very new to DataHub, so this might be an overly simple question. Apologies in advance and thanks for any help! We are trying to ingest from Snowflake into DataHub and I'm not seeing a bunch of tables that I expect to see. Using this very basic filter to just give me stuff in our PROD database. Am I missing something?

great-account-95406

09/15/2022, 2:16 PM

I’ve updated to 0.8.44 and now all my UI ingestions failed but starts a new one as cli. Please explain this behaviour?

busy-dream-34673

09/16/2022, 11:22 AM

Hi, I'm new to Datahub and start to discover all the different the data sources connectors for the metadata ingestion part. I was wondering if it is planned to get an InfluxDB connector and a Grafana connector soon ? In our team we use this time series database a lot with Grafana dashboard, it will be very interesting to get both of them into Datahub. Thanks in advance

👍 1

proud-table-38689

09/16/2022, 6:29 PM

re: airflow, I’m trying to install the new plugin and get this 404 error in my jobs:

<https://my-datahub-url/aspects?action=ingestProposal>

does not exist. Is there anything I need to install within DataHub for this to work?

✅ 1

proud-table-38689

09/16/2022, 7:37 PM

when using the Airflow lineage server, do the Datasets need to pre-exist in DataHub? e.g. would this need to preexist in DataHuB?

✅ 1