DataHub #troubleshoot

quaint-barista-82836

01/23/2023, 5:52 PM

Hi Team, Got below message for the BQ ingestion pipeline run, I ran this with standard parameter with table profiling enabled:

Copy code

'[2023-01-23 17:42:42,108] WARNING  {py.warnings:109} - '
           '/tmp/datahub/ingest/venv-bigquery-0.9.6/lib/python3.10/site-packages/datahub/ingestion/source/bigquery_v2/bigquery.py:937: '
           'DeprecationWarning: Call to deprecated function (or staticmethod) wrap_aspect_as_workunit. (use '
           'MetadataChangeProposalWrapper(...).as_workunit() instead)\n'
           '  wu = wrap_aspect_as_workunit(\n'
           '\n'
           '[2023-01-23 17:42:42,110] WARNING  {py.warnings:109} - '
           '/tmp/datahub/ingest/venv-bigquery-0.9.6/lib/python3.10/site-packages/datahub/ingestion/source/bigquery_v2/bigquery.py:957: '
           'DeprecationWarning: Call to deprecated function (or staticmethod) wrap_aspect_as_workunit. (use '
           'MetadataChangeProposalWrapper(...).as_workunit() instead)\n'
           '  wu = wrap_aspect_as_workunit("dataset", dataset_urn, "subTypes", subTypes)\n'
           '\n'
           '[2023-01-23 17:42:42,190] DEBUG    {datahub.emitter.rest_emitter:250} - Attempting to emit to DataHub GMS; using curl equivalent to:\n',
           '2023-01-23 17:42:42.336687 [exec_id=96401624-f6b0-46e7-98c9-836345181165] INFO: Caught exception EXECUTING '
           'task_id=96401624-f6b0-46e7-98c9-836345181165, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.10/asyncio/streams.py", line 525, in readline\n'
           '    line = await self.readuntil(sep)\n'
           '  File "/usr/local/lib/python3.10/asyncio/streams.py", line 620, in readuntil\n'
           '    raise exceptions.LimitOverrunError(\n'
           'asyncio.exceptions.LimitOverrunError: Separator is found, but chunk is longer than limit\n'
           '\n'
           'During handling of the above exception, another exception occurred:\n'
           '\n'
           'Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/default_executor.py", line 123, in execute_task\n'
           '    task_event_loop.run_until_complete(task_future)\n'
           '  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete\n'
           '    return future.result()\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 147, in execute\n'
           '    await tasks.gather(_read_output_lines(), _report_progress(), _process_waiter())\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 99, in _read_output_lines\n'
           '    line_bytes = await ingest_process.stdout.readline()\n'
           '  File "/usr/local/lib/python3.10/asyncio/streams.py", line 534, in readline\n'
           '    raise ValueError(e.args[0])\n'
           'ValueError: Separator is found, but chunk is longer than limit\n']}
Execution finished with errors.

cool-fireman-87485

01/23/2023, 5:58 PM

Hi all! Using the UI I tried to create some lineages through assets and it works perfectly. Now that I would modify the lineage I created, I relize that is impossible to delete the relationships. I think it is a real bug...in fact when I remove the upstream/downstream a pop-up " lineage updated!" appears, but reloading the UI page the relation is still there... Anyone experienced it?

quaint-barista-82836

01/23/2023, 10:00 PM

Hi Team, At multiple stages I am getting below error when ingesting the metadata for bigquery from CLI:

Copy code

Does your service account has bigquery.tables.list, bigquery.routines.get, bigquery.routines.list permission, bigquery.tables.getData permission? The error was: 'type'
[2023-01-23, 21:55:49 UTC] {process_utils.py:168} INFO - Traceback (most recent call last):
[2023-01-23, 21:55:49 UTC] {process_utils.py:168} INFO -   File "/tmp/venv45wzxte5/lib/python3.8/site-packages/datahub/ingestion/source/bigquery_v2/bigquery.py", line 587, in _process_project
[2023-01-23, 21:55:49 UTC] {process_utils.py:168} INFO -     yield from self._process_schema(conn, project_id, bigquery_dataset)
[2023-01-23, 21:55:49 UTC] {process_utils.py:168} INFO -   File "/tmp/venv45wzxte5/lib/python3.8/site-packages/datahub/ingestion/source/bigquery_v2/bigquery.py", line 702, in _process_schema
[2023-01-23, 21:55:49 UTC] {process_utils.py:168} INFO -     yield from self._process_table(conn, table, project_id, dataset_name)
[2023-01-23, 21:55:49 UTC] {process_utils.py:168} INFO -   File "/tmp/venv45wzxte5/lib/python3.8/site-packages/datahub/ingestion/source/bigquery_v2/bigquery.py", line 735, in _process_table
[2023-01-23, 21:55:49 UTC] {process_utils.py:168} INFO -     yield from self.gen_table_dataset_workunits(table, project_id, schema_name)
[2023-01-23, 21:55:49 UTC] {process_utils.py:168} INFO -   File "/tmp/venv45wzxte5/lib/python3.8/site-packages/datahub/ingestion/source/bigquery_v2/bigquery.py", line 774, in gen_table_dataset_workunits
[2023-01-23, 21:55:49 UTC] {process_utils.py:168} INFO -     custom_properties["time_partitioning"] = str(table.time_partitioning)
[2023-01-23, 21:55:49 UTC] {process_utils.py:168} INFO -   File "/tmp/venv45wzxte5/lib/python3.8/site-packages/google/cloud/bigquery/table.py", line 2689, in __repr__
[2023-01-23, 21:55:49 UTC] {process_utils.py:168} INFO -     key_vals = ["{}={}".format(key, val) for key, val in self._key()]
[2023-01-23, 21:55:49 UTC] {process_utils.py:168} INFO -   File "/tmp/venv45wzxte5/lib/python3.8/site-packages/google/cloud/bigquery/table.py", line 2665, in _key
[2023-01-23, 21:55:49 UTC] {process_utils.py:168} INFO -     properties["type_"] = repr(properties.pop("type"))
[2023-01-23, 21:55:49 UTC] {process_utils.py:168} INFO - KeyError: 'type'

The service account has access based on https://datahubproject.io/docs/quick-ingestion-guides/bigquery/setup/ and I am at v0.9.6.1

limited-library-89060

01/24/2023, 2:26 AM

Hi team, we want to integrate our Great Expectation results to datahub. Previuosly we got an error

Copy code

Datasource test_datasource is not present in platform_instance_map
argument of type 'NoneType' is not iterable

But after we put it into the platform instance map into the payload, the first error is not showing anymore, but the second one still there. We are using custom queries to create a dataset test, and use

expect_table_row_count_to_equal

to check whether it passed. Any help would be appreciated

flat-table-17463

01/24/2023, 6:54 AM

Hi all, We want to get table descriptions when importing metadata with using transformers. However, we could not get the table descriptions using custom transformers as mentioned in the document. How can we do this?

gray-ocean-32209

01/24/2023, 7:18 AM

We are seeing ‘Unauthorized’

Sorry, you are not authorized to access this page.

on all assets after upgrading to 0.9.5. all content appears to be unaccessible with a “Unauthorized” message. Even the admin user is not able to access any entities. We use OICD for authentication when we try look at policies on the

<datahub-url>/policies

only to get a

Copy code

Unauthorized to perform this action. Please contact your DataHub administrator. (code 403)

It was all working fine before the upgrade

✅ 1

bland-balloon-48379

01/24/2023, 4:53 PM

Hey everyone! Lately my team has been seeing some issues in one of our datahub environments, namely it appears data is not being pushed to our graph database (neo4j community edition) when new items are ingested. The main example I have is for the UpstreamLineage aspect. When ingesting a set of these aspects, we're seeing the data show up in mysql, but not neo4j. Additionally, when we hard delete the entity from datahub using the CLI. It is removed from mysql, but is not removed from neo4j. However, the connection between the gms service and neo4j seems to be working fine for standard queries because whatever dat is present in neo4j is visible in the frontend UI. The following are some steps and results while identifying and debugging this issue to create a timeline for you all: 1. Ingested new dataset entities. They appeared in mysql, neo4j, and the UI. 2. Ingested lineage data from these new dataset. All of the lineage appeared in mysql, but only a subset of the lineage appeared in neo4j & the UI (seemingly all oracle tables). 3. Reindexed a single urn for a downstream dataset. The DownstreamOf relationship now appears in neo4j for the reindexed dataset, and the correct lineage is shown in the UI. 4. Ran the RestoreIndices kubernetes job for all aspects. Job ran for ~9 hours and completed successfully however no knew relationships appeared in neo4j or the UI. 5. Restarted neo4j, no effect. 6. Manually added one of the missing edges to neo4j. The correct lineage then appeared in the UI. 7. Did a hard-delete on one of the dataset entities. The dataset was deleted from mysql and elasticsearch, and no longer was present in the UI, however the node and relationships were still present in neo4j. 8. From this point on we switched over to kafka emitter as the rest emitter was seemingly related to similar problems in the past. 9. Reingested the deleted dataset. It reappeared in the UI with the partial lineage info it had before being deleted. 10. Manually deleted the lineage relationships from neo4j for that dataset and reingested the UpstreamLineage aspect. The aspect appeared in mysql, but the relationships were not recreated in neo4j or the UI. 11. Tried several combinations of restarting datahub, restarting neo4j, reindexing, and reingesting. No effects. 12. We've also seen some validations aspects be created in mysql after ingestion, but not appear in the UI. We've seen an issue like this pop up in the past that appeared to be related to the REST sink. The REST sink was used for the first seven steps of this timeline, but we have switching to kafka emitter now. When similar issues occurred in the past we were able to resolve them by reindexing the database and restarting our graph db a few times, but that does not appear to be working here. If anyone has any thoughts or ideas regarding directions to move in on this issue, I'd love to hear them. Thanks in advance!

✅ 1

able-evening-90828

01/24/2023, 10:43 PM

What is the best way to retrieve a list of child glossary terms under a glossary node using GraphQL? The following query didn't work:

Copy code

query childGlossaryTerms {
  searchAcrossEntities(input: {
    types: [GLOSSARY_TERM], 
    query: "",
    orFilters: {
      and: {
        field: "parentNodes",
        values: ["urn:li:glossaryNode:data-type"],
      }
    }
  }) {
    searchResults {
      entity {
        urn
        type
      }
    }
  }
}

best-wire-59738

01/25/2023, 6:00 AM

Hello Team, I was having a small doubt. We have implemented the custom authenticator plugin, we have implemented in a way that we return different user URN for users belonging to different domains to datahub after Authentication, so that user wouldn’t be able to change datasets that belong to other domain. This was working fine for graphQL API. But when user hit openAPI for adding or deleting dataset he is able to do it without any domain restriction. I would like to know if policies aren’t considered when we use OpenAPI?

average-dinner-25106

01/25/2023, 7:07 AM

Hi, I am trying to upload images in the documentation. However, as the screenshot shows, the image located in datahub does not appear. What's the problem? FYI, I ran datahub quickstart.

brief-ability-41819

01/25/2023, 10:31 AM

Hello, Is it possible that DataHub uses two versions of entities in API calls? When I run commands via CURL it works properly:

Copy code

curl -X 'GET' '<https://DATAHUB_URL/openapi/entities/v1/latest?urns=MY_URN>' -H 'accept: application/json' --header 'Authorization: Bearer MY_TOKEN | jq

but when I’m trying to access the same data with:

Copy code

datahub --debug get --urn "urn:li:dataset:(MY_URN)" --aspect ownership

it throws 404:

404 Client Error: Not Found for url: <https://DATAHUB_URL/openapi/entitiesV2/MY_URN?aspects=List(ownership)>

SwaggerUI shows only

/entities/v1

and my suspicion is that it tries to reach

/entities/v2

via CLI - is there any flag to set it?

✅ 1

elegant-salesmen-99143

01/25/2023, 3:37 PM

I have a problem with stateful ingestion. It wasn’t enabled when we initially ingested the datasourse (Hive), I enabled it now, but Datahub still displays tables that are long gone. It says ‘Last synchronized 4 months ago” next to it, so we now that’s when they last existed, but it still doesn’t soft-delete them:( What can I do to clean-up all old deleted tables? I’m on 0.9.6.1 and my ingest recipe looks like this:

Copy code

sink:
    type: datahub-rest
    config:
        server: '***'
source:
    type: hive
    config:
        host_port: '***:10000'
        env: PROD
        username: ***
        include_tables: true
        include_views: true
        stateful_ingestion:
            enabled: true
            remove_stale_metadata: true
transformers:
    -
        type: set_dataset_browse_path
        config:
            replace_existing: true
            path_templates:
                - /ENV/PLATFORM/DATASET_PARTS
pipeline_name: 'urn:li:dataHubIngestionSource:***'

acceptable-restaurant-2734

01/25/2023, 7:51 PM

Silly question but if running ingestion through CLI and using docker with localhost:8080 as sink, why can I not see the UI for the metadata I ingested from BQ?

✅ 1

helpful-fish-88957

01/25/2023, 8:22 PM

Hi all, quickstart started failing for me yesterday, with the following error:

Copy code

Unable to run quickstart - the following issues were detected:
- kafka-setup container is not present

I suspect it's related to the changes in this PR: https://github.com/datahub-project/datahub/pull/7073 based on the timing on the fact that it has to do with kafka/quickstart -- but I'm pretty new to datahub so advice on how to proceed would be appreciated. Thanks!

✅ 1

faint-hair-91313

01/26/2023, 8:17 AM

Dear all, sometimes we are having some slight delays in getting everything on UI (like up to 5 seconds), or when navingating through Datasets. It does not always happen, sometimes it is instant. Is there a way to raise the performance by allocating more resources to the containers, etc.?

early-student-2446

01/26/2023, 10:28 AM

Hi all, I would like to test my Datahub sql backup, prior to starting a restore process I was trying to follow this but I’m getting:

Copy code

error: unknown object type *v1beta1.CronJob

I’m currently using k8s version:

Copy code

Server Version: <http://version.Info|version.Info>{Major:"1", Minor:"18", GitVersion:"v1.18.14", GitCommit:"89182bdd065fbcaffefec691908a739d161efc03", GitTreeState:"clean", BuildDate:"2020-12-18T12:02:35Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

are you familiar with that?

echoing-needle-51090

01/26/2023, 1:48 PM

Hi all, I would like to know anyway to reduce RAM usage ? I have just made a single ingestion pipeline, it consumed around 300 MiB which I considered too much.

✅ 1

ancient-kite-60433

01/26/2023, 2:12 PM

Hi all, we've been running DataHub for 14 days using docker quickstart, but today our DataHub front end home page started showing a big red error message:

Oops, an error occurred. This exception has been logged with id xxxxxxxx

(no login page shown, only the error message) Have restarted the quickstart container, have also rebooted the VM. Have followed the advice in https://datahubproject.io/docs/debugging/#how-can-i-confirm-if-all-docker-containers-are-running-as-expected-after-a-quickstart •

datahub docker check

returned everything was OK •

docker logs datahub-frontend-react

returned the following errors:

Copy code

play.api.UnexpectedException: Unexpected exception[ServerResultException: HTTP 1.0 client does not support chunked response]
        at play.api.http.HttpErrorHandlerExceptions$.throwableToUsefulException(HttpErrorHandler.scala:358)
        at play.api.http.DefaultHttpErrorHandler.onServerError(HttpErrorHandler.scala:264)
        at play.core.server.common.ServerResultUtils.validateResult(ServerResultUtils.scala:69)
        at play.core.server.akkahttp.AkkaModelConversion.$anonfun$convertResult$1(AkkaModelConversion.scala:193)
        at play.core.server.common.ServerResultUtils.resultConversionWithErrorHandling(ServerResultUtils.scala:195)
        at play.core.server.akkahttp.AkkaModelConversion.convertResult(AkkaModelConversion.scala:215)
        at play.core.server.AkkaHttpServer.$anonfun$runAction$5(AkkaHttpServer.scala:440)
        at scala.concurrent.Future.$anonfun$flatMap$1(Future.scala:307)
        at scala.concurrent.impl.Promise.$anonfun$transformWith$1(Promise.scala:41)
        at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
        at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:63)
        at akka.dispatch.BatchingExecutor$BlockableBatch.$anonfun$run$1(BatchingExecutor.scala:100)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
        at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:85)
        at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:100)
        at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:49)
        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:48)
        at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
        at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
        at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
        at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
        at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)
Caused by: play.core.server.common.ServerResultException: HTTP 1.0 client does not support chunked response
        at play.core.server.common.ServerResultUtils.validateResult(ServerResultUtils.scala:68)
        ... 19 common frames omitted
2023-01-26 13:44:29,799 [application-akka.actor.default-dispatcher-19] ERROR p.api.http.DefaultHttpErrorHandler -
! @80d92mm8g - Internal server error, for (GET) [/] ->

play.api.UnexpectedException: Unexpected exception[ServerResultException: HTTP 1.0 client does not support chunked response]
        at play.api.http.HttpErrorHandlerExceptions$.throwableToUsefulException(HttpErrorHandler.scala:358)
        at play.api.http.DefaultHttpErrorHandler.onServerError(HttpErrorHandler.scala:264)
        at play.core.server.common.ServerResultUtils.validateResult(ServerResultUtils.scala:69)
        at play.core.server.akkahttp.AkkaModelConversion.$anonfun$convertResult$1(AkkaModelConversion.scala:193)
        at play.core.server.common.ServerResultUtils.resultConversionWithErrorHandling(ServerResultUtils.scala:195)
        at play.core.server.akkahttp.AkkaModelConversion.convertResult(AkkaModelConversion.scala:215)
        at play.core.server.AkkaHttpServer.$anonfun$runAction$5(AkkaHttpServer.scala:440)
        at scala.concurrent.Future.$anonfun$flatMap$1(Future.scala:307)
        at scala.concurrent.impl.Promise.$anonfun$transformWith$1(Promise.scala:41)
        at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
        at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:63)
        at akka.dispatch.BatchingExecutor$BlockableBatch.$anonfun$run$1(BatchingExecutor.scala:100)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
        at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:85)
        at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:100)
        at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:49)
        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:48)
        at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
        at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
        at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
        at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
        at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)
Caused by: play.core.server.common.ServerResultException: HTTP 1.0 client does not support chunked response
        at play.core.server.common.ServerResultUtils.validateResult(ServerResultUtils.scala:68)
        ... 19 common frames omitted
2023-01-26 13:44:29,863 [application-akka.actor.default-dispatcher-19] ERROR p.api.http.DefaultHttpErrorHandler -
! @80d92mmb1 - Internal server error, for (GET) [/favicon.ico] ->

Would greatly appreciate any suggestions. Thanks!

✅ 1

bland-orange-13353

01/26/2023, 2:13 PM

If you’re having trouble with quickstart, please make sure you’re using the most up-to-date version of DataHub by following the steps in the quickstart deployment guide: https://datahubproject.io/docs/quickstart/#deploying-datahub. Specifically, ensure you’re up to date with the DataHub CLI:

Copy code

python3 -m pip install --upgrade pip wheel setuptools
python3 -m pip install --upgrade acryl-datahub
datahub version

rhythmic-quill-75064

01/26/2023, 2:33 PM

Hello team. The transition from version 0.2.105 to version 0.2.106 fails. The datahub-elasticsearch-setup-job is failing, here is the log:

Copy code

2023/01/26 14:22:48 Waiting for: <http://elasticsearch-master:9200>
Going to use protocol: http
Going to use default elastic headers
Create datahub_usage_event if needed against Elasticsearch at elasticsearch-master:9200
Going to use index prefix::
2023/01/26 14:22:48 Received 200 from <http://elasticsearch-master:9200>
Policy GET response code is
Got response code  while creating policy so exiting.
curl: option -k <http://elasticsearch-master:9200/_ilm/policy/datahub_usage_event_policy>: is unknown
curl: try 'curl --help' or 'curl --manual' for more information
/create-indices.sh: line 41: [: -eq: unary operator expected
/create-indices.sh: line 45: [: -eq: unary operator expected
/create-indices.sh: line 47: [: -eq: unary operator expected
2023/01/26 14:22:48 Command exited with error: exit status 1

Any ideas ?

✅ 2

aloof-father-61672

01/26/2023, 2:47 PM

Hello everyone. Attempting to generate a list of "pipeline" URNs. However, I receive no results. My script works fine with

dataset

entities but not with `dataflow`/`datajob`entities. Is this a bug? I even tried making use of

datahub.cli.cli_utils.get_urns_by_filter

See https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/cli/cli_utils.py Same output. I also tried entity type dataFlow/dataJob. Number of entities returned is zero. URL: DataHub GMS host +

/entities?action=search

Payload

Copy code

{
  "input": "*",
  "entity": "dataflow",
  "start": 0,
  "count": 100,
  "filter": {
    "or": [
      {
        "and": [
          {
            "field": "origin",
            "value": "DEV",
            "condition": "EQUAL"
          },
          {
            "field": "platform",
            "value": "urn:li:dataPlatform:my-platform",
            "condition": "EQUAL"
          }
        ]
      }
    ]
  }
}

Response

Copy code

{
  "value": {
    "numEntities": 0,
    "pageSize": 100,
    "from": 0,
    "metadata": {
      "aggregations": [
        {
          "name": "origin",
          "filterValues": [],
          "aggregations": {},
          "displayName": "origin"
        },
        {
          "name": "platform",
          "filterValues": [],
          "aggregations": {},
          "displayName": "Platform"
        }
      ]
    },
    "entities": []
  }
}

quick-pizza-8906

01/26/2023, 5:29 PM

Hello, after upgrading my deployment to 0.9.6.1 version (from 0.9.1) my tableau ingestor stopped working - at the end of ingestion it produces error

Remote end closed connection without response

(see attached log). I noticed that my deployment versioned 0.9.1 uses

tableauserverclient

version

0.19.0

while the newer one used

0.23.4

- I downgraded it on my newer deployment to

0.19.0

only to see same exception... Note that my existing 0.9.1 deployment connects to the tableau server just fine so it's not a matter of networking/server being down. Was there any significant change applied to tableau connector which could have caused it? Does anybody suffer similar problems?

tableau_problem.log

nutritious-bird-77396

01/26/2023, 5:42 PM

Hi Team, After upgrading my datahub version from

0.8.43

0.9.6.1

I am facing errors with reindexing...

Copy code

17:30:57 [main] INFO  c.l.m.s.e.i.ESIndexBuilder - Reindexing dataset_operationaspect_v1 to dataset_operationaspect_v1_1674751305780 task has completed, will now check if reindex was successful
17:31:00 [main] INFO  c.l.m.s.e.i.ESIndexBuilder - Post-reindex document count is different, source_doc_count: 34822915 reindex_doc_count: 15463000
17:31:00 [main] WARN  o.s.w.c.s.XmlWebApplicationContext - Exception encountered during context initialization - cancelling refresh attempt: org.springframework.beans.factory.UnsatisfiedDependencyException: Error creating bean with name 'metadataChangeLogProcessor' defined in URL [jar:file:/tmp/jetty-0_0_0_0-8080-war_war-_-any-3785592998662924994/webapp/WEB-INF/lib/mae-consumer.jar!/com/linkedin/metadata/kafka/MetadataChangeLogProcessor.class]: Unsatisfied dependency expressed through constructor parameter 0; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'updateIndicesHook' defined in URL [jar:file:/tmp/jetty-0_0_0_0-8080-war_war-_-any-3785592998662924994/webapp/WEB-INF/lib/mae-consumer.jar!/com/linkedin/metadata/kafka/hook/UpdateIndicesHook.class]: Bean instantiation via constructor failed; nested exception is org.springframework.beans.BeanInstantiationException: Failed to instantiate [com.linkedin.metadata.kafka.hook.UpdateIndicesHook]: Constructor threw exception; nested exception is java.lang.RuntimeException: Reindex from dataset_operationaspect_v1 to dataset_operationaspect_v1_1674751305780 failed
17:31:00 [main] INFO  c.l.r.t.h.c.c.AbstractNettyClient - Shutdown requested
17:31:00 [main] INFO  c.l.r.t.h.c.c.AbstractNettyClient - Shutting down

Any body else faced this issue? Any tips would help....

able-evening-90828

01/26/2023, 11:27 PM

The

andFilter

in the

orFilters

SearchInput

seems to require all fields of a dataset to match the `andFilter`'s condition. Otherwise, the dataset won't be returned. For example, say we have a dataset that has the following columns and tags defined

Copy code

col1: [tagA, tagB]
col2: [tagA]

If I do a GraphQL query below. Then the dataset is not returned, even though

col2

satisfied the filter condition.

Copy code

query searchDataset {
  search(input: {
    type: DATASET, 
    query: "", 
    start: 0, 
    count: 1000,
    orFilters: [
      {
        and: [
          {
            field: "fieldTags",
            values: ["urn:li:tag:tagA"]
            condition: CONTAIN
          }
          {
            field: "fieldTags",
            values: ["urn:li:tag:tagB"]
            condition: CONTAIN
            negated: true
          }
        ]
      }
    ]
  }) {
    start
    count
    total
    searchResults {
      entity {
        urn
        type
      }
    }
  
}
}

What I want is if at least one column satisfies the tag filter condition, then the dataset should be returned. How can I achieve this?

bland-orange-13353

01/27/2023, 12:57 AM

Copy code

python3 -m pip install --upgrade pip wheel setuptools
python3 -m pip install --upgrade acryl-datahub
datahub version

rhythmic-glass-37647

01/27/2023, 1:28 AM

Hi, I'm trying to setup ingestion from the cli, I very simple yaml file im using but i keep getting

PipelineInitError

any help would be appreciated!

✅ 1

brief-ability-41819

01/27/2023, 6:50 AM

Hello, Is there a way of changing ClusterIP to LoadBalancer in this subchart: https://github.com/acryldata/datahub-helm/blob/master/charts/datahub/subcharts/acryl-datahub-actions/values.yaml ? I tried to apply it (of course running

helm dep update

before an upgrade itself) and it still show service as ClusterIP. I have a feeling that I’m missing something. FYI we’re running DataHub 0.9.1 on EKS.

✅ 1

plus1 1

best-wire-59738

01/27/2023, 7:04 AM

Hello Team, I was noticing datahub-frontend was not getting updated with gms. Like when I was running Ingestion from UI, I was getting pop-up that run was triggered and in action logs i can see its getting ingested but UI is not getting updated with the latest run details and also I invited new user using invite link and the user is not showing up in users tab in UI. Could you please debug the Issue we are running DataHub 0.9.6 on EKS.

acceptable-terabyte-34789

01/27/2023, 7:13 AM

how can I delete from cli a dataset? trying to use: datahub delete --urn "urnlidataset:(urnlidataPlatform:athena,xxx_exception,PROD)" --dry-run but throws:

Copy code

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/requests/models.py", line 971, in json
    return complexjson.loads(self.text, **kwargs)
  File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

gray-ocean-32209

01/27/2023, 1:27 PM

Hello Team We are experimenting with testing Airflow datahub integrations (Airflow lineage backend) with datahub quickstart and datahub-airflow docker https://datahubproject.io/docs/docker/airflow/local_airflow

Copy code

[lineage]
backend = datahub_provider.lineage.datahub.DatahubLineageBackend
datahub_kwargs = {
    "datahub_conn_id": "datahub_rest_default",
    "cluster": "local_airflow",
    "capture_ownership_info": true,
    "capture_tags_info": true,
    "capture_executions": true,
    "graceful_exceptions": true }

To see the information about the runs history of airflow tasks in datahub added

"capture_executions": true

whenever we add this option `"capture_executions": true`` and try to initialize airflow with cmd

docker-compose up airflow-init

it fails with

Copy code

....
datahub-airflow-airflow-init-1  |     _backend = get_backend()
datahub-airflow-airflow-init-1  |   File "/home/airflow/.local/lib/python3.9/site-packages/airflow/lineage/__init__.py", line 61, in get_backend
datahub-airflow-airflow-init-1  |     return clazz()
datahub-airflow-airflow-init-1  |   File "/home/airflow/.local/lib/python3.9/site-packages/datahub_provider/lineage/datahub.py", line 64, in __init__
datahub-airflow-airflow-init-1  |     _ = get_lineage_config()
datahub-airflow-airflow-init-1  |   File "/home/airflow/.local/lib/python3.9/site-packages/datahub_provider/lineage/datahub.py", line 35, in get_lineage_config
datahub-airflow-airflow-init-1  |     return DatahubLineageConfig.parse_obj(kwargs)
datahub-airflow-airflow-init-1  |   File "pydantic/main.py", line 511, in pydantic.main.BaseModel.parse_obj
datahub-airflow-airflow-init-1  |   File "pydantic/main.py", line 331, in pydantic.main.BaseModel.__init__
datahub-airflow-airflow-init-1  | pydantic.error_wrappers.ValidationError: 1 validation error for DatahubLineageConfig
datahub-airflow-airflow-init-1  | capture_executions

I’m running

acryldata/airflow-datahub:latest

image is ’`capture_executions`, is not supported?