https://datahubproject.io logo
Join Slack
Powered by
# ingestion
  • f

    future-florist-65080

    02/21/2023, 12:53 AM
    Hi All Is it possible to use the transformer
    pattern_add_dataset_schema_terms
    to add glossary terms to fields within a specific database schema? I have tried using the regex
    .*<schema>.*<field_name>.*
    , similar to the example for Pattern Add Dataset Domain. However this is not applying any glossary terms. It seems like the regex is applying only to the field name, not to the full URN?
    plus1 1
  • r

    red-action-68363

    02/21/2023, 9:58 AM
    ingestion delta lake error?
    d
    • 2
    • 3
  • n

    numerous-account-62719

    02/21/2023, 12:59 PM
    Hi Team Getting the following error Can someone help me
    Copy code
    (urn:li:dataPlatform:postgres,inventory_data.public._v_router_interface_master,PROD)\n'
               '[2023-02-21 12:20:33,827] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit '
               'inventory_data.public._v_router_interface_master\n'
               '[2023-02-21 12:20:33,949] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit _v_router_interface_master-subtypes\n'
               '[2023-02-21 12:20:34,027] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit _v_router_interface_master-viewProperties\n'
               '[2023-02-21 12:20:34,119] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit '
               'container-urn:li:container:3bd5879a590e50509e9cf6b786d6170e-to-urn:li:dataset:(urn:li:dataPlatform:postgres,inventory_data.public._v_sitedetails,PROD)\n'
               '[2023-02-21 12:20:34,237] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit inventory_data.public._v_sitedetails\n'
               '[2023-02-21 12:20:34,366] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit _v_sitedetails-subtypes\n'
               '[2023-02-21 12:20:34,384] INFO     {datahub.ingestion.source.ge_data_profiler:930} - Finished profiling '
               'inventory_data.public.correlation_groups_master_new; took 66.890 seconds\n'
               '[2023-02-21 12:20:35,687] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit _v_sitedetails-viewProperties\n'
               '[2023-02-21 12:20:35,742] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit '
               'container-urn:li:container:3bd5879a590e50509e9cf6b786d6170e-to-urn:li:dataset:(urn:li:dataPlatform:postgres,inventory_data.public._v_sitelist1,PROD)\n'
               '[2023-02-21 12:20:35,824] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit inventory_data.public._v_sitelist1\n'
               '[2023-02-21 12:20:35,898] INFO     {datahub.ingestion.source.ge_data_profiler:930} - Finished profiling '
               'inventory_data.public.bbip_route_details; took 63.394 seconds\n'
               '[2023-02-21 12:20:35,967] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit _v_sitelist1-subtypes\n'
               '[2023-02-21 12:20:36,087] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit _v_sitelist1-viewProperties\n'
               '[2023-02-21 12:20:36,154] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit '
               'container-urn:li:container:3bd5879a590e50509e9cf6b786d6170e-to-urn:li:dataset:(urn:li:dataPlatform:postgres,inventory_data.public._v_edgelist2,PROD)\n'
               '[2023-02-21 12:20:36,250] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit inventory_data.public._v_edgelist2\n'
               '[2023-02-21 12:20:36,377] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit _v_edgelist2-subtypes\n'
               '[2023-02-21 12:20:36,516] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit _v_edgelist2-viewProperties\n'
               '[2023-02-21 12:20:36,618] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit '
               'container-urn:li:container:3bd5879a590e50509e9cf6b786d6170e-to-urn:li:dataset:(urn:li:dataPlatform:postgres,inventory_data.public._v_sitetopology2,PROD)\n'
               '[2023-02-21 12:20:36,800] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit inventory_data.public._v_sitetopology2\n'
               '[2023-02-21 12:20:36,855] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit _v_sitetopology2-subtypes\n'
               '[2023-02-21 12:20:36,933] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit _v_sitetopology2-viewProperties\n'
               '[2023-02-21 12:20:37,004] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit '
               'container-urn:li:container:3bd5879a590e50509e9cf6b786d6170e-to-urn:li:dataset:(urn:li:dataPlatform:postgres,inventory_data.public._v_topologygraphsites,PROD)\n'
               '[2023-02-21 12:20:37,101] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit '
               'inventory_data.public._v_topologygraphsites\n'
               '[2023-02-21 12:20:37,226] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit _v_topologygraphsites-subtypes\n'
               '[2023-02-21 12:20:37,342] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit _v_topologygraphsites-viewProperties\n'
               '[2023-02-21 12:20:37,487] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit '
               'container-urn:li:container:3bd5879a590e50509e9cf6b786d6170e-to-urn:li:dataset:(urn:li:dataPlatform:postgres,inventory_data.public._v_nodetopology2,PROD)\n'
               '[2023-02-21 12:20:37,633] INFO     {datahub.ingestion.run.pipeline:102} - sink wrote workunit inventory_data.public._v_nodetopology2\n'
               '/usr/local/bin/run_ingest.sh: line 26:   695 Killed                  ( python3 -m datahub ingest -c "$4/$1.yml" )\n',
               "2023-02-21 12:20:39.139032 [exec_id=26e8f736-796a-430b-8987-81242e1d53b2] INFO: Failed to execute 'datahub ingest'",
               '2023-02-21 12:20:39.140298 [exec_id=26e8f736-796a-430b-8987-81242e1d53b2] INFO: Caught exception EXECUTING '
               'task_id=26e8f736-796a-430b-8987-81242e1d53b2, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
               '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/default_executor.py", line 121, in execute_task\n'
               '    self.event_loop.run_until_complete(task_future)\n'
               '  File "/usr/local/lib/python3.9/site-packages/nest_asyncio.py", line 89, in run_until_complete\n'
               '    return f.result()\n'
               '  File "/usr/local/lib/python3.9/asyncio/futures.py", line 201, in result\n'
               '    raise self._exception\n'
               '  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 256, in __step\n'
               '    result = coro.send(None)\n'
               '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 115, in execute\n'
               '    raise TaskError("Failed to execute \'datahub ingest\'")\n'
               "acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"]}
    b
    d
    h
    • 4
    • 13
  • g

    green-lock-62163

    02/21/2023, 5:41 PM
    Hi All, I am struggling with the lineage_job_dataflow_new_api_simple.py. All the previously created / added lineages are lost when the script gets re-executed. scenario ; launch the script, pick one job in the pipeline and add a manual addtitional downstream lineage with another job. Relaunch the script and check that the manual lineage vanished.
    ✅ 1
    b
    • 2
    • 18
  • r

    red-waitress-53338

    02/21/2023, 6:09 PM
    Hi, I am trying to ingest the metadata from the UI, I am getting the following error when I am trying to setup the actions framework.
    ✅ 1
    r
    o
    • 3
    • 6
  • a

    acceptable-nest-20465

    02/21/2023, 6:17 PM
    Hello All, Is it possible to ingest data from neo4j ?
    ✅ 1
    o
    • 2
    • 1
  • w

    white-horse-97256

    02/21/2023, 6:28 PM
    Hi Team, how can we specify the datahub-gms server url in for JAVA based emitter?
    ✅ 1
    r
    a
    o
    • 4
    • 7
  • f

    few-library-66655

    02/21/2023, 6:54 PM
    Hello, team! I am trying to ingest AWS Glue, and I found that the
    aws_access_key_id
    ,
    aws_secret_access_key
    and
    aws_session_token
    will get expired.
    Copy code
    ClientError: An error occurred (ExpiredTokenException) when calling the GetDatabases operation: The security token included in the request is expired
    So I have to generate a new token every time I run the execution manually. I would like to
    Configure an Ingestion Schedule
    to sync Glue everyday, then what should I do to automatically get the latest aws tokens? Or how can I make it not expired?
    c
    • 2
    • 3
  • w

    white-horse-97256

    02/21/2023, 9:50 PM
    Hi Team, When we create a recipe from cli, will datahub store the source credentials in it's database?
    b
    • 2
    • 11
  • f

    few-library-66655

    02/22/2023, 12:13 AM
    I am running Datahub locally and trying to ingest metadata from Glue. Here is my recipe:
    Copy code
    source:
        type: glue
        config:
            aws_region: us-west-2
            aws_role: 'arn:aws:iam::ACCOUNT_ID:role/ROLE_NAME'
    When I run this in prod, it is succeeded, and I want to test it in my local. But it failed in local with the error message
    Copy code
    PipelineInitError: Failed to configure the source (glue): 'NoneType' object has no attribute 'access_key'
    can anyone help me with that?
    a
    • 2
    • 3
  • b

    bright-wall-5515

    02/22/2023, 9:18 AM
    hey everyone 👋, I’m new to DataHub, and we are working with an instance set up by our previous team Some of our tasks on our DataHub pipeline started to fail since the weekend, due to the error:
    Copy code
    [2023-02-22 00:26:03,368] {{pod_launcher.py:156}} INFO - b' File "pydantic/validators.py", line 715, in find_validators\n'
    [2023-02-22 00:26:03,368] {{pod_launcher.py:156}} INFO - b"RuntimeError: no validator found for <class 're.Pattern'>, see `arbitrary_types_allowed` in Config\n"
    [2023-02-22 00:26:04,425] {{pod_launcher.py:171}} INFO - Event: workspaces-pipeline-5357b4ab2cdd4abf933f2d307a107cf2 had an event of type Failed
    Can someone help to find the source of this problem, or where to look for the error please? 🙏
    h
    • 2
    • 15
  • a

    ambitious-umbrella-5901

    02/22/2023, 1:29 PM
    Looking to ingest from Mongo Atlas. We use X509 certificates for authentication. I don’t see anything in the documentation for the mongodb module that would allow for this. Any suggestions on how to use certificate based authentication?
    a
    • 2
    • 1
  • b

    bitter-evening-61050

    02/22/2023, 3:03 PM
    Hi Team, I am trying to action integrate microsoft teams with datahub. I am getting this error when i tried to install the teams action pip install datahub-actions[teams] ERROR: Could not find a version that satisfies the requirement datahub-actions[teams] (from versions: none) ERROR: No matching distribution found for datahub-actions[teams] datahub version =0.9.5
    a
    o
    • 3
    • 2
  • g

    great-flag-53653

    02/22/2023, 5:14 PM
    Hi! I'm testing Datahub out locally and I started yesterday by ingesting metadata from our BigQuery source, which generated table lineage correctly. Now today I added our DBT source to the recipe, and ran the ingestion again. For some tables I now see the DBT icon as well as the BigQuery icon in the graphical lineage, but for some (in this case a view on top of a table) I see duplicated datasets/nodes in the lineage. Do I need to do remove everything from Datahub and then run my ingestion again now that I have both DBT and BigQuery in my recipe? Also, the case sensitivity seem to show on the view nodes (the two on the right) but not for the tables? All of these tables and views are "PascalCase" which show correctly on the views , and only if I click on one of the others in the lineage graph, then I see the right casing on the name in the window that pops up Example:
    a
    h
    • 3
    • 13
  • a

    acoustic-quill-54426

    02/22/2023, 6:34 PM
    Hi! We are using the advanced use case of transformers to add multiple aspects to some datasets depending on business logic. After upgrading the bigquery source connector, we seem to receive
    MetadataChangeProposal
    instead of
    MetadataChangeEvent
    in our
    transform
    . Before we could add aspects to the `proposedSnapsot.aspects`list. Do you have any tip on what we can do now? Maybe yield a new RecordEnvelope with a MCP?
    ✅ 1
    m
    • 2
    • 6
  • w

    white-horse-97256

    02/22/2023, 8:14 PM
    Hi Team, can we create dataflow from CLI?
    h
    • 2
    • 3
  • w

    white-horse-97256

    02/22/2023, 10:54 PM
    Hi Team, another question what is the difference between
    MetadataChangeProposalWrapper
    and
    UpsertAspectRequest?
    h
    • 2
    • 1
  • s

    silly-intern-25190

    02/23/2023, 2:19 AM
    Hi team , hope everyone is good . I am facing a weird issue while ingestion in vertica database and dont know if its from datahub or sqlalchemy, whenever ingestion starts datahub create 5 connection in database but it uses only 1 connection to execute the query . Also if max_threads are set to lets say 5 , it create 25 connection in database but still uses 1 connection to execute the query . It will be helpful if someone can point me towards any documentation regarding how connection with database get managed .
    h
    b
    • 3
    • 9
  • r

    rich-daybreak-77194

    02/23/2023, 5:17 AM
    Hi everyone i’ll integrate great expectation to datahub when i run checkpoint file i got error “great_expectations.exceptions.exceptions.InvalidDataContextConfigError: Error while processing DataContextConfig: datasources” How can i solve it?
    great_expectations.ymlnew_checkpoint
    h
    • 2
    • 6
  • b

    best-notebook-58252

    02/23/2023, 7:36 AM
    Hi all. I’m trying to run a
    lookml
    ingestion with CLI, but it’s stuck because my ssh key requires a passphrase. Is it possible to pass it? I was trying something like
    echo $PASSWORD | datahub ingest -c lookml.yaml
    but it doesn’t work
    h
    • 2
    • 1
  • q

    quiet-jelly-11365

    02/23/2023, 9:17 AM
    Hi team, when I just try to check plugins on my machine (Mac M1 chip), I get the below error. Is anyone else facing the same issue ?
    Copy code
    datahub check plugins
    Sources:
    [2023-02-22 17:52:56,159] ERROR    {datahub.entrypoints:225} - Command failed: code() argument 13 must be str, not int
    o
    • 2
    • 2
  • l

    lemon-scooter-69730

    02/23/2023, 10:41 AM
    Is it possible to ingest dbt manifest/catalogue data from a google storage bucket?
    a
    • 2
    • 2
  • r

    red-easter-85320

    02/23/2023, 10:56 AM
    what's the account/password for mysql in datahub docker? #ingestion
    ✅ 1
    b
    • 2
    • 2
  • c

    creamy-portugal-88620

    02/23/2023, 12:33 PM
    Hi Team, ingested metadata from Athena but getting error below
    d
    • 2
    • 2
  • c

    creamy-portugal-88620

    02/23/2023, 12:33 PM
    Copy code
    2023-02-23 12:22:16,182 INFO sqlalchemy.engine.Engine [raw sql] {}
    [2023-02-23 12:22:16,182] INFO     {sqlalchemy.engine.Engine:1858} - [raw sql] {}
    [2023-02-23 12:22:17,543] WARNING  {datahub.ingestion.source.sql.sql_common:643} - Unable to ingest sampledb.temp_batch_dlq due to an exception.
     Traceback (most recent call last):
      File "/tmp/datahub/ingest/venv-athena-0.10.0/lib/python3.10/site-packages/datahub/ingestion/source/sql/sql_common.py", line 639, in loop_tables
        yield from self._process_table(
      File "/tmp/datahub/ingest/venv-athena-0.10.0/lib/python3.10/site-packages/datahub/ingestion/source/sql/sql_common.py", line 738, in _process_table
        yield from self.add_table_to_schema_container(
      File "/tmp/datahub/ingest/venv-athena-0.10.0/lib/python3.10/site-packages/datahub/ingestion/source/sql/athena.py", line 238, in add_table_to_schema_container
        parent_container_key=self.get_database_container_key(db_name, schema),
      File "/tmp/datahub/ingest/venv-athena-0.10.0/lib/python3.10/site-packages/datahub/ingestion/source/sql/athena.py", line 220, in get_database_container_key
        assert db_name == schema
    AssertionError
  • b

    bitter-furniture-95993

    02/23/2023, 2:08 PM
    Hello, I am trying to configure airflow to send lineage to datahub using kafka. Airflow and datahub are on different servers. I am able to send successfully data using Rest connection but kafka connection fails. He seems to be looking for (host='localhost', port=8081), so I guess kafka registry parameter is missing, but I can't find how to add it. I verified that the network ports are opened on both side. Here is the error I have :
    [2023-02-23, 13:36:35 UTC] {logging_mixin.py:137} INFO - Exception: Traceback (most recent call last): File "/home/debian/.local/lib/python3.7/site-packages/urllib3/connection.py", line 175, in _new_conn (self._dns_host, self.port), self.timeout, **extra_kw File "/home/debian/.local/lib/python3.7/site-packages/urllib3/util/connection.py", line 95, in create_connection raise err File "/home/debian/.local/lib/python3.7/site-packages/urllib3/util/connection.py", line 85, in create_connection sock.connect(sa) ConnectionRefusedError: [Errno 111] Connection refused During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/debian/.local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 710, in urlopen chunked=chunked, File "/home/debian/.local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 398, in _make_request conn.request(method, url, **httplib_request_kw) File "/home/debian/.local/lib/python3.7/site-packages/urllib3/connection.py", line 239, in request super(HTTPConnection, self).request(method, url, body=body, headers=headers) File "/usr/lib/python3.7/http/client.py", line 1260, in request self._send_request(method, url, body, headers, encode_chunked) File "/usr/lib/python3.7/http/client.py", line 1306, in _send_request self.endheaders(body, encode_chunked=encode_chunked) File "/usr/lib/python3.7/http/client.py", line 1255, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File "/usr/lib/python3.7/http/client.py", line 1030, in _send_output self.send(msg) File "/usr/lib/python3.7/http/client.py", line 970, in send self.connect() File "/home/debian/.local/lib/python3.7/site-packages/urllib3/connection.py", line 205, in connect conn = self._new_conn() File "/home/debian/.local/lib/python3.7/site-packages/urllib3/connection.py", line 187, in _new_conn self, "Failed to establish a new connection: %s" % e urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fce2cc7e828>: Failed to establish a new connection: [Errno 111] Connection refused During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/debian/.local/lib/python3.7/site-packages/requests/adapters.py", line 499, in send timeout=timeout, File "/home/debian/.local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 788, in urlopen method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2] File "/home/debian/.local/lib/python3.7/site-packages/urllib3/util/retry.py", line 592, in increment raise MaxRetryError(_pool, url, error or ResponseError(cause)) urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=8081): Max retries exceeded with url: /subjects/MetadataChangeProposal_v1-value/versions (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fce2cc7e828>: Failed to establish a new connection: [Errno 111] Connection refused')) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/debian/.local/lib/python3.7/site-packages/confluent_kafka/serializing_producer.py", line 172, in produce value = self._value_serializer(value, ctx) File "/home/debian/.local/lib/python3.7/site-packages/confluent_kafka/schema_registry/avro.py", line 251, in __call__ self._schema) File "/home/debian/.local/lib/python3.7/site-packages/confluent_kafka/schema_registry/schema_registry_client.py", line 338, in register_schema body=request) File "/home/debian/.local/lib/python3.7/site-packages/confluent_kafka/schema_registry/schema_registry_client.py", line 127, in post return self.send_request(url, method='POST', body=body) File "/home/debian/.local/lib/python3.7/site-packages/confluent_kafka/schema_registry/schema_registry_client.py", line 169, in send_request headers=headers, data=body, params=query) File "/home/debian/.local/lib/python3.7/site-packages/requests/sessions.py", line 587, in request resp = self.send(prep, **send_kwargs) File "/home/debian/.local/lib/python3.7/site-packages/requests/sessions.py", line 701, in send r = adapter.send(request, **kwargs) File "/home/debian/.local/lib/python3.7/site-packages/requests/adapters.py", line 565, in send raise ConnectionError(e, request=request) requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=8081): Max retries exceeded with url: /subjects/MetadataChangeProposal_v1-value/versions (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fce2cc7e828>: Failed to establish a new connection: [Errno 111] Connection refused')) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/debian/.local/lib/python3.7/site-packages/datahub_provider/_plugin.py", line 281, in custom_on_success_callback datahub_task_status_callback(context, status=InstanceRunResult.SUCCESS) File "/home/debian/.local/lib/python3.7/site-packages/datahub_provider/_plugin.py", line 145, in datahub_task_status_callback dataflow.emit(emitter, callback=_make_emit_callback(task.log)) File "/home/debian/.local/lib/python3.7/site-packages/datahub/api/entities/datajob/dataflow.py", line 140, in emit emitter.emit(mcp, callback) File "/home/debian/.local/lib/python3.7/site-packages/datahub/emitter/kafka_emitter.py", line 119, in emit return self.emit_mcp_async(item, callback or _error_reporting_callback) File "/home/debian/.local/lib/python3.7/site-packages/datahub/emitter/kafka_emitter.py", line 150, in emit_mcp_async on_delivery=callback, File "/home/debian/.local/lib/python3.7/site-packages/confluent_kafka/serializing_producer.py", line 174, in produce raise ValueSerializationError(se) confluent_kafka.error.ValueSerializationError: KafkaError{code=_VALUE_SERIALIZATION,val=-161,str="HTTPConnectionPool(host='localhost', port=8081): Max retries exceeded with url: /subjects/MetadataChangeProposal_v1-value/versions (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fce2cc7e828>: Failed to establish a new connection: [Errno 111] Connection refused'))"} [2023-02-23, 13:36:35 UTC] {local_task_job.py:159} INFO - Task exited with return code 0 [2023-02-23, 13:36:35 UTC] {taskinstance.py:2582} INFO - 0 downstream tasks scheduled from follow-on schedule check
  • s

    straight-laptop-6275

    02/23/2023, 2:42 PM
    Hi, I want to look into the MySQL database where the meta data events are stored
    ✅ 1
  • s

    straight-laptop-6275

    02/23/2023, 2:42 PM
    Is someone aware of the credentials?
  • s

    straight-laptop-6275

    02/23/2023, 2:52 PM
    I mean the credentials of prerequisite-mysql
    b
    • 2
    • 1
  • l

    lemon-scooter-69730

    02/23/2023, 5:03 PM
    How do you clear already ingested datasets?
    a
    b
    • 3
    • 6
1...105106107...144Latest