DataHub #ingestion

bumpy-pharmacist-66525

01/16/2023, 1:33 PM

Hi everyone, I am using the DBT ingestion source and am having issues with the

column_meta_mapping

feature. It seems I am not able to add either a

tag

or a

term

at the column level, however, the

meta_mapping

feature to add tags/terms to the node itself seems to work fine. Not too sure if this is important, but for reference purposes, I am using DBT & Iceberg together. I actually went into the source code of the DBT ingestion source to try to find why it wasn't working. It seems that on line 1201 of the

dbt_common.py

file that the

columns

field (node.columns) in the DBTNode is always empty (https://github.com/datahub-project/datahub/blob/ce5545ed27eeb56669d0adccc0030fad7c[…]tadata-ingestion/src/datahub/ingestion/source/dbt/dbt_common.py). It doesn't seem to be populated at any point in time, which is why I believe it is causing the issue with column level meta mapping not working. Is anyone able to confirm if the

column_meta_mapping

feature for the DBT ingestion source is working for them?

✅ 1

alert-fall-82501

01/16/2023, 4:16 PM

Hi Team - Just need quick suggestion , I am ingesting hive metadata and it has large volume datasets. Can we limit the ingestion datasets ? or is there option to insert only selected datasets ? .. TIA .

✅ 1

lemon-daybreak-58504

01/16/2023, 6:35 PM

hi everyone i am having trouble configuring a bigquery ingest source the log says i am having trouble with this: "include_usage_statistics", does anyone know how to fix it? ~~ Execution Summary ~~ RUN_INGEST - {'errors': [], 'exec_id': '53197807-e009-4413-9ba3-90a5a678a646', 'infos': ['2023-01-16 174152.832770 [exec_id=53197807-e009-4413-9ba3-90a5a678a646] INFO: Starting execution for task with name=RUN_INGEST', '2023-01-16 174158.926519 [exec_id=53197807-e009-4413-9ba3-90a5a678a646] INFO: stdout=venv setup time = 0\n' 'This version of datahub supports report-to functionality\n' 'datahub ingest run -c /tmp/datahub/ingest/53197807-e009-4413-9ba3-90a5a678a646/recipe.yml --report-to ' '/tmp/datahub/ingest/53197807-e009-4413-9ba3-90a5a678a646/ingestion_report.json\n' '[2023-01-16 174155,025] INFO {datahub.cli.ingest_cli:177} - DataHub CLI version: 0.8.43.5\n' '[2023-01-16 174155,050] INFO {datahub.ingestion.run.pipeline:163} - Sink configured successfully. DataHubRestEmitter: configured ' 'to talk to http://datahub-datahub-gms:8080\n' '[2023-01-16 174158,036] ERROR {datahub.entrypoints:192} - \n' 'Traceback (most recent call last):\n' ' File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 184, in __init__\n' ' self.source: Source = source_class.create(\n' ' File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/source/sql/bigquery.py", line 989, in create\n' ' config = BigQueryConfig.parse_obj(config_dict)\n' ' File "pydantic/main.py", line 521, in pydantic.main.BaseModel.parse_obj\n' ' File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/source_config/sql/bigquery.py", line 69, in __init__\n' ' super().__init__(**data)\n' ' File "pydantic/main.py", line 341, in pydantic.main.BaseModel.__init__\n' 'pydantic.error_wrappers.ValidationError: 1 validation error for BigQueryConfig\n' 'include_usage_statistics\n' ' extra fields not permitted (type=value_error.extra)\n' '\n' 'The above exception was the direct cause of the following exception:\n' '\n' 'Traceback (most recent call last):\n' ' File "/usr/local/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 190, in run\n' ' pipeline = Pipeline.create(\n' ' File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 301, in create\n' ' return cls(\n' ' File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 189, in __init__\n' ' self._record_initialization_failure(\n' ' File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 117, in _record_initialization_failure\n' ' raise PipelineInitError(msg) from e\n' 'datahub.ingestion.run.pipeline.PipelineInitError: Failed to configure source (bigquery)\n' '[2023-01-16 174158,036] ERROR {datahub.entrypoints:195} - Command failed: \n' '\tFailed to configure source (bigquery) due to \n' "\t\t'1 validation error for BigQueryConfig\n" 'include_usage_statistics\n' " extra fields not permitted (type=value_error.extra)'.\n" '\tRun with --debug to get full stacktrace.\n' "\te.g. 'datahub --debug ingest run -c /tmp/datahub/ingest/53197807-e009-4413-9ba3-90a5a678a646/recipe.yml --report-to " "/tmp/datahub/ingest/53197807-e009-4413-9ba3-90a5a678a646/ingestion_report.json'\n", "2023-01-16 174158.926748 [exec_id=53197807-e009-4413-9ba3-90a5a678a646] INFO: Failed to execute 'datahub ingest'", '2023-01-16 174158.926928 [exec_id=53197807-e009-4413-9ba3-90a5a678a646] INFO: Caught exception EXECUTING ' 'task_id=53197807-e009-4413-9ba3-90a5a678a646, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n' ' File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/default_executor.py", line 123, in execute_task\n' ' task_event_loop.run_until_complete(task_future)\n' ' File "/usr/local/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete\n' ' return future.result()\n' ' File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 168, in execute\n' ' raise TaskError("Failed to execute \'datahub ingest\'")\n' "acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"]} Execution finished with errors.

✅ 1

👀 1

gentle-portugal-21014

01/16/2023, 8:29 PM

Hi *, We've been trying to get the OpenAPI ingestion to work (albeit with the existing limitation of supporting only the GET methods), but we keep getting several warnings related to missing examples within the swagger file and total_records_written being 0. Note that we stored the swagger file on a different WWW server, because the real API endpoints are not accessible from that server. Nevertheless, I'd assume that "warnings" shouldn't mean reasons for ignoring the method altogether and having examples in the swagger file should not be critical as long as the method documentations use properly defined structures. On the other hand, the discussion in https://datahubspace.slack.com/archives/CUMUWQU66/p1665773094663049 seems to suggest that at least my second assumption may not be correct - is it really the case that the current OpenAPI support doesn't use the structure definitions and relies solely on the examples instead? BTW, I get the same result if I download the swagger file for OpenAPI of the DataHub itself (from /openapi/v3/api-docs on the DataHub server) and process it the same way as we try for our own swagger files, i.e. it's perfectly reproducible... Any comments, anybody?

✅ 1

polite-actor-701

01/17/2023, 1:07 AM

Hello, everyone. I'm doing ingestion from the source I made. The source was newly created to select and ingest a dataset from oracle, and it was created by referring to sql_common. However, after ingestion, I checked and found that there was data that was not indexed in ES. Looking at the attached image, there are 18,731 datasets in mysql, but only 18,729 dataset indexes in kibana. 2 are missing And if you look at the first and second data in kibana, you can see that there is no information such as browsePaths and platform. So the frontend shows only 18,727 data. It's not a problem that's consistently seen with certain data, sometimes everything is indexed well and sometimes it's missing. May I know what the problem is? And is there any way to detect missing indexing or incorrectly indexed data?

✅ 1

bland-appointment-45659

01/17/2023, 5:04 AM

hello team, I am trying to use stateful ingestion and noticed a pattern that the objects previously ingested got deleted. Looking further I could see that the ingestion was updated to pull different sets of objects each run in the past. Is it an expectation that the filters do not get updated when stateful ingestion is enabled ? If we update filters and pull partial data, will we get all older ingested metadata wiped off ?

miniature-branch-33689

01/17/2023, 6:55 AM

Hey guys, I have a similar problem as [Anthony' thread] Im trying to ingest from SageMaker, but ingestion is pending. I can start more manual ingestions, execution counter goes up but they are all pending. In the datahub-actions container logs I can see some errors.

Copy code

[2023-01-17 02:17:36,141] INFO     {datahub_actions.cli.actions:98} - Action Pipeline with name 'ingestion_executor' is now running.
...
  File "/usr/local/lib/python3.9/site-packages/avrogen/avrojson.py", line 358, in _record_from_json
    raise ValueError(f'{readers_schema.fullname} contains extra fields: {input_keys}')
ValueError: com.linkedin.pegasus2avro.common.AuditStamp contains extra fields: {'message'}

✅ 1

acceptable-morning-73148

01/17/2023, 9:34 AM

Hello there, we have an ingestion process using SQLAlchemy as source and getting the following couple of errors:

unable to map type INTERVAL_MONTH(precision=4) to metadata schema

and

unable to map type INTERVAL_DAY(precision=4) to metadata schema

. Those are valid column types in our system, but not recognized it seems in DataHub. Any suggestions how to handle?

✅ 1

thousands-yacht-8284

01/17/2023, 10:20 AM

Hi all, I have a question about the glossary terms ingestion. We have some terms that are synonyms and I can't find the "synonyms" relation. I wonder if it exists or if I should the custom properties to create a synonyms list. Can someone has already the need and how did you fix it?

👀 2

curved-planet-99787

01/17/2023, 12:58 PM

Is there a way to specify a subset of fields/columns to be screened during SQL profiling? I know that there is the parameter

max_number_of_fields_to_profile

which allows to only profile a certain amount of fields but I want to exclude or include only specific fields by their name

plus1 2

✅ 1

limited-forest-73733

01/17/2023, 2:27 PM

Hey team! I deployed latest datahub version i.e. 0.9.6, using latest cli version but i am not able to see column level lineage. Do i need to enable any field? Thanks in advance

✅ 1

creamy-machine-95935

01/17/2023, 6:46 PM

Hey Team! I am trying to develop a LookML Recipe. Could you please help me if it is compatible with Gitlab or only with Github? 🚀

👀 1

late-gpu-33114

01/17/2023, 7:12 PM

Hi, is there a way to ingest a graphql api similar to REST via OpenAPI?

✅ 1

lively-dusk-19162

01/17/2023, 7:15 PM

Hey Team, I am trying to create a new entity in Datahub. What are the steps involved in creating a new entity?

✅ 1

👀 1

brash-helicopter-28341

01/18/2023, 7:36 AM

Hi Team!

✅ 1

brash-helicopter-28341

01/18/2023, 7:38 AM

experiencing some bug with ingestion run twice at once. https://github.com/datahub-project/datahub/issues/7053 is there some solution to fix it?

✅ 1

elegant-salesmen-99143

01/18/2023, 11:27 AM

is there any way to get notifications when scheduled ingestions fail?

✅ 1

flat-engineer-75197

01/18/2023, 12:21 PM

👋 Hey all I have a Glue catalog where some of the databases are actually resource links, shared from other AWS accounts. The Glue recipe works fine except for when it hits these “non-native” databases. I thought to filter them based on the original owner but as far as I can tell, none of the recipe options allow this. Has anybody run into this before?

👀 1

important-helmet-98156

01/18/2023, 12:58 PM

Hello Team, we are currently evaluating the use of Data Hub as our Data Catalog and are facing some questions regarding licensing as we only have a Runtime License for our SAP Hana Database but want to ingest Metadata from our SAP Software Products like SAP ERP, SAP BW and SAP C4C. For the connection to SAP Hana as a Metadata Source the connector from the docu uses hdbcli (https://datahubproject.io/docs/generated/ingestion/sources/hana), and from the Python Package Index for hdbcli under license I found the following:

By using this software, you agree that the following text is incorporated into the terms of the Developer Agreement:

If you are an existing SAP customer for On-Premise software, your use of this current software is also covered by the terms of your software license agreement with SAP, including the Use Rights, the current version of which can be found at: <https://www.sap.com/about/agreements/product-use-and-support-terms.html?tag=agreements:product-use-support-terms/on-premise-software/software-use-rights>

For me, this would mean that I could use the connector and get the metadata into Data Hub with our existing SAP Hana On-Pemise database. However, I am not sure, therefore, I am asking you if someone had a similar issue. Thank you in advance and best regards Martin #sap #sap-hana

👀 1

✅ 1

lemon-daybreak-58504

01/18/2023, 1:27 PM

Hi every one im having this error when ingesting data from bigquery ~~ Execution Summary ~~ RUN_INGEST - {'errors': [], 'exec_id': '53197807-e009-4413-9ba3-90a5a678a646', 'infos': ['2023-01-16 174152.832770 [exec_id=53197807-e009-4413-9ba3-90a5a678a646] INFO: Starting execution for task with name=RUN_INGEST', '2023-01-16 174158.926519 [exec_id=53197807-e009-4413-9ba3-90a5a678a646] INFO: stdout=venv setup time = 0\n' 'This version of datahub supports report-to functionality\n' 'datahub ingest run -c /tmp/datahub/ingest/53197807-e009-4413-9ba3-90a5a678a646/recipe.yml --report-to ' '/tmp/datahub/ingest/53197807-e009-4413-9ba3-90a5a678a646/ingestion_report.json\n' '[2023-01-16 174155,025] INFO {datahub.cli.ingest_cli:177} - DataHub CLI version: 0.8.43.5\n' '[2023-01-16 174155,050] INFO {datahub.ingestion.run.pipeline:163} - Sink configured successfully. DataHubRestEmitter: configured ' 'to talk to http://datahub-datahub-gms:8080\n' '[2023-01-16 174158,036] ERROR {datahub.entrypoints:192} - \n' 'Traceback (most recent call last):\n' ' File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 184, in __init__\n' ' self.source: Source = source_class.create(\n' ' File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/source/sql/bigquery.py", line 989, in create\n' ' config = BigQueryConfig.parse_obj(config_dict)\n' ' File "pydantic/main.py", line 521, in pydantic.main.BaseModel.parse_obj\n' ' File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/source_config/sql/bigquery.py", line 69, in __init__\n' ' super().__init__(**data)\n' ' File "pydantic/main.py", line 341, in pydantic.main.BaseModel.__init__\n' 'pydantic.error_wrappers.ValidationError: 1 validation error for BigQueryConfig\n' 'include_usage_statistics\n' ' extra fields not permitted (type=value_error.extra)\n' '\n' 'The above exception was the direct cause of the following exception:\n' '\n' 'Traceback (most recent call last):\n' ' File "/usr/local/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 190, in run\n' ' pipeline = Pipeline.create(\n' ' File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 301, in create\n' ' return cls(\n' ' File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 189, in __init__\n' ' self._record_initialization_failure(\n' ' File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 117, in _record_initialization_failure\n' ' raise PipelineInitError(msg) from e\n' 'datahub.ingestion.run.pipeline.PipelineInitError: Failed to configure source (bigquery)\n' '[2023-01-16 174158,036] ERROR {datahub.entrypoints:195} - Command failed: \n' '\tFailed to configure source (bigquery) due to \n' "\t\t'1 validation error for BigQueryConfig\n" 'include_usage_statistics\n' " extra fields not permitted (type=value_error.extra)'.\n" '\tRun with --debug to get full stacktrace.\n' "\te.g. 'datahub --debug ingest run -c /tmp/datahub/ingest/53197807-e009-4413-9ba3-90a5a678a646/recipe.yml --report-to " "/tmp/datahub/ingest/53197807-e009-4413-9ba3-90a5a678a646/ingestion_report.json'\n", "2023-01-16 174158.926748 [exec_id=53197807-e009-4413-9ba3-90a5a678a646] INFO: Failed to execute 'datahub ingest'", '2023-01-16 174158.926928 [exec_id=53197807-e009-4413-9ba3-90a5a678a646] INFO: Caught exception EXECUTING ' 'task_id=53197807-e009-4413-9ba3-90a5a678a646, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n' ' File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/default_executor.py", line 123, in execute_task\n' ' task_event_loop.run_until_complete(task_future)\n' ' File "/usr/local/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete\n' ' return future.result()\n' ' File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 168, in execute\n' ' raise TaskError("Failed to execute \'datahub ingest\'")\n' "acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"]} Execution finished with errors. my datahub version: acryl-datahub, version 0.9.6 recipe: source: type: bigquery config: include_table_lineage: true include_usage_statistics: false include_tables: true include_views: true profiling: enabled: true profile_table_level_only: true stateful_ingestion: enabled: true credential: project_id: innovation-datalake-test private_key_id: <private_key_id> private_key: '${Bigquery_innovation_test}' client_email: <client_email> client_id: <client_id> the connection works but can't ingest

👀 1

✅ 1

magnificent-lawyer-97772

01/18/2023, 4:38 PM

Hey folks, I am trying to ingest some metadata from Databricks using the Hive connector. There seems to a problem that the platform attribute isn’t supported. All my Databricks resources end up having hive in urn’s. Any advice?

alert-fall-82501

01/18/2023, 5:54 PM

Hi team - I am making connection between airflow and datahub . I am trying to do the connection setting through airflow UI . can anybody tell what things need to given in server endpoint ?..What type ? Thanks in Advance !

✅ 1

alert-fall-82501

01/18/2023, 5:54 PM

image.png

👀 1

bright-receptionist-94235

01/19/2023, 4:43 AM

we have installed latest version 0.9.6 created mysql ingestion and status is stuck on “Pending”, ideas?

👀 1

✅ 2

orange-actor-62586

01/19/2023, 6:57 AM

Hello All, i’m Alex , Data Governance Manager. Nice to meet you^^ Our company has been using the data hub since last year, and this year we plan to make it easy for users to find all data-related materials on the data hub. In order for users to easily and quickly find data, it is extremely important for us to design the site’s information structure for the data hub. Therefore, I am curious if there is an update for quickly searching for data hub information in the 2023 roadmap +_+

✅ 1

great-computer-16446

01/19/2023, 7:54 AM

Hi team, https://github.com/datahub-project/datahub/issues/7080 We have encountered some problems related to the performance of mae consumer. After trying several methods, there is no obvious effect. At present, the lag continues to increase. Are there any suggested solutions? Thanks

✅ 1

salmon-helmet-338

01/19/2023, 8:53 AM

Hi team, I have been having issues with S3 Ingestion from UI (instead I am able to run it via Console successfully). I select from UI the "Ingestion" tab and then the option "Other" providing the following recipe which works as S3 import from Console:

Copy code

source:
  type: s3
  config:
    path_specs:
      - include: <mys3path>
    aws_config:
      aws_region: <myregion>
      aws_access_key_id: <mykey>
      aws_secret_access_key: <mysecret>
    env: "test"
    profiling:
      enabled: false

and I got the following error when I run it from UI:

Copy code

'Collecting pyspark==3.0.3\n'
           '/usr/local/bin/ingestion_common.sh: line 3:    44 Killed                  pip install -r $req_file\n',
           "2023-01-19 08:20:42.689903 [exec_id=4ba00e6f-ecac-405a-a5cd-281dd4f1cf94] INFO: Failed to execute 'datahub ingest'",
           '2023-01-19 08:20:42.692510 [exec_id=4ba00e6f-ecac-405a-a5cd-281dd4f1cf94] INFO: Caught exception EXECUTING '
           'task_id=4ba00e6f-ecac-405a-a5cd-281dd4f1cf94, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/default_executor.py", line 123, in execute_task\n'
           '    task_event_loop.run_until_complete(task_future)\n'
           '  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete\n'
           '    return future.result()\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 168, in execute\n'
           '    raise TaskError("Failed to execute \'datahub ingest\'")\n'
           "acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"]}
Execution finished with errors.

Would you know what can be the reason and solution? Thanks a lot

👀 1

✅ 1

salmon-angle-92685

01/19/2023, 11:07 AM

Hello guys, Would someone have an example in python on how to set a Glossary Term to a Table via the graphql API ? The doc isn't clear for me. Thank you so much!

✅ 1

elegant-salesmen-99143

01/19/2023, 1:42 PM

Hi. I don'te see

include_views

option in config details for Hive. Is it not possible to display views in Hive in Datahub?🤔

👀 1

brief-oyster-50637

01/19/2023, 5:26 PM

“Domains”: how to map to them? Hi all. Is it possible adding a dataset to a DataHub Domain based on source metadata? To be more specific, my source is dbt and I’ve been trying to do this with dbt meta properties. Documentation only mentions add_tag, add_term, add_terms, add_owner. Nothing like “add_domain”. Isn’t it possible? Am I missing something? Are there other ways to map source’s metadata to Domain? Thank you! https://datahubproject.io/docs/generated/ingestion/sources/dbt/#dbt-meta-automated-mappings