DataHub #ingestion

future-florist-65080

02/21/2023, 12:53 AM

Hi All Is it possible to use the transformer

pattern_add_dataset_schema_terms

to add glossary terms to fields within a specific database schema? I have tried using the regex

.*<schema>.*<field_name>.*

, similar to the example for Pattern Add Dataset Domain. However this is not applying any glossary terms. It seems like the regex is applying only to the field name, not to the full URN?

plus1 1

red-action-68363

02/21/2023, 9:58 AM

ingestion delta lake error?

numerous-account-62719

02/21/2023, 12:59 PM

Hi Team Getting the following error Can someone help me

Copy code

(urn:li:dataPlatform:postgres,inventory_data.public._v_router_interface_master,PROD)\n'
           '[2023-02-21 12:20:33,827] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit '
           'inventory_data.public._v_router_interface_master\n'
           '[2023-02-21 12:20:33,949] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit _v_router_interface_master-subtypes\n'
           '[2023-02-21 12:20:34,027] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit _v_router_interface_master-viewProperties\n'
           '[2023-02-21 12:20:34,119] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit '
           'container-urn:li:container:3bd5879a590e50509e9cf6b786d6170e-to-urn:li:dataset:(urn:li:dataPlatform:postgres,inventory_data.public._v_sitedetails,PROD)\n'
           '[2023-02-21 12:20:34,237] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit inventory_data.public._v_sitedetails\n'
           '[2023-02-21 12:20:34,366] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit _v_sitedetails-subtypes\n'
           '[2023-02-21 12:20:34,384] INFO     {datahub.ingestion.source.ge_data_profiler:930} - Finished profiling '
           'inventory_data.public.correlation_groups_master_new; took 66.890 seconds\n'
           '[2023-02-21 12:20:35,687] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit _v_sitedetails-viewProperties\n'
           '[2023-02-21 12:20:35,742] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit '
           'container-urn:li:container:3bd5879a590e50509e9cf6b786d6170e-to-urn:li:dataset:(urn:li:dataPlatform:postgres,inventory_data.public._v_sitelist1,PROD)\n'
           '[2023-02-21 12:20:35,824] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit inventory_data.public._v_sitelist1\n'
           '[2023-02-21 12:20:35,898] INFO     {datahub.ingestion.source.ge_data_profiler:930} - Finished profiling '
           'inventory_data.public.bbip_route_details; took 63.394 seconds\n'
           '[2023-02-21 12:20:35,967] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit _v_sitelist1-subtypes\n'
           '[2023-02-21 12:20:36,087] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit _v_sitelist1-viewProperties\n'
           '[2023-02-21 12:20:36,154] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit '
           'container-urn:li:container:3bd5879a590e50509e9cf6b786d6170e-to-urn:li:dataset:(urn:li:dataPlatform:postgres,inventory_data.public._v_edgelist2,PROD)\n'
           '[2023-02-21 12:20:36,250] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit inventory_data.public._v_edgelist2\n'
           '[2023-02-21 12:20:36,377] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit _v_edgelist2-subtypes\n'
           '[2023-02-21 12:20:36,516] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit _v_edgelist2-viewProperties\n'
           '[2023-02-21 12:20:36,618] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit '
           'container-urn:li:container:3bd5879a590e50509e9cf6b786d6170e-to-urn:li:dataset:(urn:li:dataPlatform:postgres,inventory_data.public._v_sitetopology2,PROD)\n'
           '[2023-02-21 12:20:36,800] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit inventory_data.public._v_sitetopology2\n'
           '[2023-02-21 12:20:36,855] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit _v_sitetopology2-subtypes\n'
           '[2023-02-21 12:20:36,933] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit _v_sitetopology2-viewProperties\n'
           '[2023-02-21 12:20:37,004] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit '
           'container-urn:li:container:3bd5879a590e50509e9cf6b786d6170e-to-urn:li:dataset:(urn:li:dataPlatform:postgres,inventory_data.public._v_topologygraphsites,PROD)\n'
           '[2023-02-21 12:20:37,101] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit '
           'inventory_data.public._v_topologygraphsites\n'
           '[2023-02-21 12:20:37,226] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit _v_topologygraphsites-subtypes\n'
           '[2023-02-21 12:20:37,342] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit _v_topologygraphsites-viewProperties\n'
           '[2023-02-21 12:20:37,487] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit '
           'container-urn:li:container:3bd5879a590e50509e9cf6b786d6170e-to-urn:li:dataset:(urn:li:dataPlatform:postgres,inventory_data.public._v_nodetopology2,PROD)\n'
           '[2023-02-21 12:20:37,633] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit inventory_data.public._v_nodetopology2\n'
           '/usr/local/bin/run_ingest.sh: line 26:   695 Killed                  ( python3 -m datahub ingest -c "$4/$1.yml" )\n',
           "2023-02-21 12:20:39.139032 [exec_id=26e8f736-796a-430b-8987-81242e1d53b2] INFO: Failed to execute 'datahub ingest'",
           '2023-02-21 12:20:39.140298 [exec_id=26e8f736-796a-430b-8987-81242e1d53b2] INFO: Caught exception EXECUTING '
           'task_id=26e8f736-796a-430b-8987-81242e1d53b2, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/default_executor.py", line 121, in execute_task\n'
           '    self.event_loop.run_until_complete(task_future)\n'
           '  File "/usr/local/lib/python3.9/site-packages/nest_asyncio.py", line 89, in run_until_complete\n'
           '    return f.result()\n'
           '  File "/usr/local/lib/python3.9/asyncio/futures.py", line 201, in result\n'
           '    raise self._exception\n'
           '  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 256, in __step\n'
           '    result = coro.send(None)\n'
           '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 115, in execute\n'
           '    raise TaskError("Failed to execute \'datahub ingest\'")\n'
           "acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"]}

green-lock-62163

02/21/2023, 5:41 PM

Hi All, I am struggling with the lineage_job_dataflow_new_api_simple.py. All the previously created / added lineages are lost when the script gets re-executed. scenario ; launch the script, pick one job in the pipeline and add a manual addtitional downstream lineage with another job. Relaunch the script and check that the manual lineage vanished.

✅ 1

red-waitress-53338

02/21/2023, 6:09 PM

Hi, I am trying to ingest the metadata from the UI, I am getting the following error when I am trying to setup the actions framework.

✅ 1

acceptable-nest-20465

02/21/2023, 6:17 PM

Hello All, Is it possible to ingest data from neo4j ?

✅ 1

white-horse-97256

02/21/2023, 6:28 PM

Hi Team, how can we specify the datahub-gms server url in for JAVA based emitter?

✅ 1

few-library-66655

02/21/2023, 6:54 PM

Hello, team! I am trying to ingest AWS Glue, and I found that the

aws_access_key_id

aws_secret_access_key

and

aws_session_token

will get expired.

Copy code

ClientError: An error occurred (ExpiredTokenException) when calling the GetDatabases operation: The security token included in the request is expired

So I have to generate a new token every time I run the execution manually. I would like to Configure an Ingestion Schedule
to sync Glue everyday, then what should I do to automatically get the latest aws tokens? Or how can I make it not expired?

white-horse-97256

02/21/2023, 9:50 PM

Hi Team, When we create a recipe from cli, will datahub store the source credentials in it's database?

few-library-66655

02/22/2023, 12:13 AM

I am running Datahub locally and trying to ingest metadata from Glue. Here is my recipe:

Copy code

source:
    type: glue
    config:
        aws_region: us-west-2
        aws_role: 'arn:aws:iam::ACCOUNT_ID:role/ROLE_NAME'

When I run this in prod, it is succeeded, and I want to test it in my local. But it failed in local with the error message

Copy code

PipelineInitError: Failed to configure the source (glue): 'NoneType' object has no attribute 'access_key'

can anyone help me with that?

bright-wall-5515

02/22/2023, 9:18 AM

hey everyone 👋, I’m new to DataHub, and we are working with an instance set up by our previous team Some of our tasks on our DataHub pipeline started to fail since the weekend, due to the error:

Copy code

[2023-02-22 00:26:03,368] {{pod_launcher.py:156}} INFO - b' File "pydantic/validators.py", line 715, in find_validators\n'
[2023-02-22 00:26:03,368] {{pod_launcher.py:156}} INFO - b"RuntimeError: no validator found for <class 're.Pattern'>, see `arbitrary_types_allowed` in Config\n"
[2023-02-22 00:26:04,425] {{pod_launcher.py:171}} INFO - Event: workspaces-pipeline-5357b4ab2cdd4abf933f2d307a107cf2 had an event of type Failed

Can someone help to find the source of this problem, or where to look for the error please? 🙏

ambitious-umbrella-5901

02/22/2023, 1:29 PM

Looking to ingest from Mongo Atlas. We use X509 certificates for authentication. I don’t see anything in the documentation for the mongodb module that would allow for this. Any suggestions on how to use certificate based authentication?

bitter-evening-61050

02/22/2023, 3:03 PM

Hi Team, I am trying to action integrate microsoft teams with datahub. I am getting this error when i tried to install the teams action pip install datahub-actions[teams] ERROR: Could not find a version that satisfies the requirement datahub-actions[teams] (from versions: none) ERROR: No matching distribution found for datahub-actions[teams] datahub version =0.9.5

great-flag-53653

02/22/2023, 5:14 PM

Hi! I'm testing Datahub out locally and I started yesterday by ingesting metadata from our BigQuery source, which generated table lineage correctly. Now today I added our DBT source to the recipe, and ran the ingestion again. For some tables I now see the DBT icon as well as the BigQuery icon in the graphical lineage, but for some (in this case a view on top of a table) I see duplicated datasets/nodes in the lineage. Do I need to do remove everything from Datahub and then run my ingestion again now that I have both DBT and BigQuery in my recipe? Also, the case sensitivity seem to show on the view nodes (the two on the right) but not for the tables? All of these tables and views are "PascalCase" which show correctly on the views , and only if I click on one of the others in the lineage graph, then I see the right casing on the name in the window that pops up Example:

acoustic-quill-54426

02/22/2023, 6:34 PM

Hi! We are using the advanced use case of transformers to add multiple aspects to some datasets depending on business logic. After upgrading the bigquery source connector, we seem to receive

MetadataChangeProposal

instead of

MetadataChangeEvent

in our

transform

. Before we could add aspects to the `proposedSnapsot.aspects`list. Do you have any tip on what we can do now? Maybe yield a new RecordEnvelope with a MCP?

✅ 1

white-horse-97256

02/22/2023, 8:14 PM

Hi Team, can we create dataflow from CLI?

white-horse-97256

02/22/2023, 10:54 PM

Hi Team, another question what is the difference between

MetadataChangeProposalWrapper

and

UpsertAspectRequest?

silly-intern-25190

02/23/2023, 2:19 AM

Hi team , hope everyone is good . I am facing a weird issue while ingestion in vertica database and dont know if its from datahub or sqlalchemy, whenever ingestion starts datahub create 5 connection in database but it uses only 1 connection to execute the query . Also if max_threads are set to lets say 5 , it create 25 connection in database but still uses 1 connection to execute the query . It will be helpful if someone can point me towards any documentation regarding how connection with database get managed .

rich-daybreak-77194

02/23/2023, 5:17 AM

Hi everyone i’ll integrate great expectation to datahub when i run checkpoint file i got error “great_expectations.exceptions.exceptions.InvalidDataContextConfigError: Error while processing DataContextConfig: datasources” How can i solve it?

great_expectations.yml new_checkpoint

best-notebook-58252

02/23/2023, 7:36 AM

Hi all. I’m trying to run a

lookml

ingestion with CLI, but it’s stuck because my ssh key requires a passphrase. Is it possible to pass it? I was trying something like

echo $PASSWORD | datahub ingest -c lookml.yaml

but it doesn’t work

quiet-jelly-11365

02/23/2023, 9:17 AM

Hi team, when I just try to check plugins on my machine (Mac M1 chip), I get the below error. Is anyone else facing the same issue ?

Copy code

datahub check plugins
Sources:
[2023-02-22 17:52:56,159] ERROR    {datahub.entrypoints:225} - Command failed: code() argument 13 must be str, not int

lemon-scooter-69730

02/23/2023, 10:41 AM

Is it possible to ingest dbt manifest/catalogue data from a google storage bucket?

red-easter-85320

02/23/2023, 10:56 AM

what's the account/password for mysql in datahub docker? #ingestion

✅ 1

creamy-portugal-88620

02/23/2023, 12:33 PM

Hi Team, ingested metadata from Athena but getting error below

creamy-portugal-88620

02/23/2023, 12:33 PM

Copy code

2023-02-23 12:22:16,182 INFO sqlalchemy.engine.Engine [raw sql] {}
[2023-02-23 12:22:16,182] INFO     {sqlalchemy.engine.Engine:1858} - [raw sql] {}
[2023-02-23 12:22:17,543] WARNING  {datahub.ingestion.source.sql.sql_common:643} - Unable to ingest sampledb.temp_batch_dlq due to an exception.
 Traceback (most recent call last):
  File "/tmp/datahub/ingest/venv-athena-0.10.0/lib/python3.10/site-packages/datahub/ingestion/source/sql/sql_common.py", line 639, in loop_tables
    yield from self._process_table(
  File "/tmp/datahub/ingest/venv-athena-0.10.0/lib/python3.10/site-packages/datahub/ingestion/source/sql/sql_common.py", line 738, in _process_table
    yield from self.add_table_to_schema_container(
  File "/tmp/datahub/ingest/venv-athena-0.10.0/lib/python3.10/site-packages/datahub/ingestion/source/sql/athena.py", line 238, in add_table_to_schema_container
    parent_container_key=self.get_database_container_key(db_name, schema),
  File "/tmp/datahub/ingest/venv-athena-0.10.0/lib/python3.10/site-packages/datahub/ingestion/source/sql/athena.py", line 220, in get_database_container_key
    assert db_name == schema
AssertionError

bitter-furniture-95993

02/23/2023, 2:08 PM

Hello, I am trying to configure airflow to send lineage to datahub using kafka. Airflow and datahub are on different servers. I am able to send successfully data using Rest connection but kafka connection fails. He seems to be looking for (host='localhost', port=8081), so I guess kafka registry parameter is missing, but I can't find how to add it. I verified that the network ports are opened on both side. Here is the error I have :

[2023-02-23, 13:36:35 UTC] {logging_mixin.py:137} INFO - Exception: Traceback (most recent call last): File "/home/debian/.local/lib/python3.7/site-packages/urllib3/connection.py", line 175, in _new_conn (self._dns_host, self.port), self.timeout, **extra_kw File "/home/debian/.local/lib/python3.7/site-packages/urllib3/util/connection.py", line 95, in create_connection raise err File "/home/debian/.local/lib/python3.7/site-packages/urllib3/util/connection.py", line 85, in create_connection sock.connect(sa) ConnectionRefusedError: [Errno 111] Connection refused During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/debian/.local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 710, in urlopen chunked=chunked, File "/home/debian/.local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 398, in _make_request conn.request(method, url, **httplib_request_kw) File "/home/debian/.local/lib/python3.7/site-packages/urllib3/connection.py", line 239, in request super(HTTPConnection, self).request(method, url, body=body, headers=headers) File "/usr/lib/python3.7/http/client.py", line 1260, in request self._send_request(method, url, body, headers, encode_chunked) File "/usr/lib/python3.7/http/client.py", line 1306, in _send_request self.endheaders(body, encode_chunked=encode_chunked) File "/usr/lib/python3.7/http/client.py", line 1255, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File "/usr/lib/python3.7/http/client.py", line 1030, in _send_output self.send(msg) File "/usr/lib/python3.7/http/client.py", line 970, in send self.connect() File "/home/debian/.local/lib/python3.7/site-packages/urllib3/connection.py", line 205, in connect conn = self._new_conn() File "/home/debian/.local/lib/python3.7/site-packages/urllib3/connection.py", line 187, in _new_conn self, "Failed to establish a new connection: %s" % e urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fce2cc7e828>: Failed to establish a new connection: [Errno 111] Connection refused During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/debian/.local/lib/python3.7/site-packages/requests/adapters.py", line 499, in send timeout=timeout, File "/home/debian/.local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 788, in urlopen method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2] File "/home/debian/.local/lib/python3.7/site-packages/urllib3/util/retry.py", line 592, in increment raise MaxRetryError(_pool, url, error or ResponseError(cause)) urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=8081): Max retries exceeded with url: /subjects/MetadataChangeProposal_v1-value/versions (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fce2cc7e828>: Failed to establish a new connection: [Errno 111] Connection refused')) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/debian/.local/lib/python3.7/site-packages/confluent_kafka/serializing_producer.py", line 172, in produce value = self._value_serializer(value, ctx) File "/home/debian/.local/lib/python3.7/site-packages/confluent_kafka/schema_registry/avro.py", line 251, in __call__ self._schema) File "/home/debian/.local/lib/python3.7/site-packages/confluent_kafka/schema_registry/schema_registry_client.py", line 338, in register_schema body=request) File "/home/debian/.local/lib/python3.7/site-packages/confluent_kafka/schema_registry/schema_registry_client.py", line 127, in post return self.send_request(url, method='POST', body=body) File "/home/debian/.local/lib/python3.7/site-packages/confluent_kafka/schema_registry/schema_registry_client.py", line 169, in send_request headers=headers, data=body, params=query) File "/home/debian/.local/lib/python3.7/site-packages/requests/sessions.py", line 587, in request resp = self.send(prep, **send_kwargs) File "/home/debian/.local/lib/python3.7/site-packages/requests/sessions.py", line 701, in send r = adapter.send(request, **kwargs) File "/home/debian/.local/lib/python3.7/site-packages/requests/adapters.py", line 565, in send raise ConnectionError(e, request=request) requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=8081): Max retries exceeded with url: /subjects/MetadataChangeProposal_v1-value/versions (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fce2cc7e828>: Failed to establish a new connection: [Errno 111] Connection refused')) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/debian/.local/lib/python3.7/site-packages/datahub_provider/_plugin.py", line 281, in custom_on_success_callback datahub_task_status_callback(context, status=InstanceRunResult.SUCCESS) File "/home/debian/.local/lib/python3.7/site-packages/datahub_provider/_plugin.py", line 145, in datahub_task_status_callback dataflow.emit(emitter, callback=_make_emit_callback(task.log)) File "/home/debian/.local/lib/python3.7/site-packages/datahub/api/entities/datajob/dataflow.py", line 140, in emit emitter.emit(mcp, callback) File "/home/debian/.local/lib/python3.7/site-packages/datahub/emitter/kafka_emitter.py", line 119, in emit return self.emit_mcp_async(item, callback or _error_reporting_callback) File "/home/debian/.local/lib/python3.7/site-packages/datahub/emitter/kafka_emitter.py", line 150, in emit_mcp_async on_delivery=callback, File "/home/debian/.local/lib/python3.7/site-packages/confluent_kafka/serializing_producer.py", line 174, in produce raise ValueSerializationError(se) confluent_kafka.error.ValueSerializationError: KafkaError{code=_VALUE_SERIALIZATION,val=-161,str="HTTPConnectionPool(host='localhost', port=8081): Max retries exceeded with url: /subjects/MetadataChangeProposal_v1-value/versions (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fce2cc7e828>: Failed to establish a new connection: [Errno 111] Connection refused'))"} [2023-02-23, 13:36:35 UTC] {local_task_job.py:159} INFO - Task exited with return code 0 [2023-02-23, 13:36:35 UTC] {taskinstance.py:2582} INFO - 0 downstream tasks scheduled from follow-on schedule check

straight-laptop-6275

02/23/2023, 2:42 PM

Hi, I want to look into the MySQL database where the meta data events are stored

✅ 1

straight-laptop-6275

02/23/2023, 2:42 PM

Is someone aware of the credentials?

straight-laptop-6275

02/23/2023, 2:52 PM

I mean the credentials of prerequisite-mysql

lemon-scooter-69730

02/23/2023, 5:03 PM

How do you clear already ingested datasets?