DataHub #troubleshoot

lively-jackal-83760

09/08/2022, 10:37 AM

Hi guys. I'm working on my Spring boot app which uses io.acryl:datahub-client dependency. Now trying to update to the new version 0.8.44 and got this

Copy code

The calling method's class, org.springframework.beans.factory.xml.XmlBeanDefinitionReader, is available from the following locations:

    jar:file:/Users/dseredenko/.m2/repository/org/springframework/spring-beans/5.3.21/spring-beans-5.3.21.jar!/org/springframework/beans/factory/xml/XmlBeanDefinitionReader.class
    jar:file:/Users/dseredenko/.m2/repository/io/acryl/datahub-client/0.8.44/datahub-client-0.8.44.jar!/org/springframework/beans/factory/xml/XmlBeanDefinitionReader.class

seems like datahub-client has its own springframework which conflicts with mine. Don't see any dependency in pom file of datahub-client and don't understand how to exclude spring from datahub-client. Can someone help?

bland-orange-13353

09/08/2022, 11:41 AM

This message was deleted.

agreeable-belgium-70840

09/08/2022, 12:33 PM

So, I am having this issue: I there was a glossary term and it lost it's name, it is named just by the urn id. I hard-deleted it from the datahub cli but it is still visible in the datahub UI. I ran the restore-indices-job successfully, but I can still see it. I also did a search in the database for this particular urn, it is not there. How can I get rid of it from the UI? Thanks

astonishing-byte-5433

09/08/2022, 5:55 PM

Hey it seems like in docker-compose-without-neo4j.quickstart.yml:

Copy code

volumes:
    - ../mysql/init.sql:/docker-entrypoint-initdb.d/init.sql

Throwing following error when running first time on a new machine and causing the mysql container to fail:

Can't initialize batch_readline - may be the input source is a directory or a block device.

After running the compose file a second time it doesn't throw this issue again and everything works fine. Also removed the volume from the script and it seems to run fine on a fresh machine.

cuddly-butcher-39945

09/08/2022, 2:42 PM

Hey Gang! I'm facing an issue related to developing custom actions and getting Datahub to use these. I've attached my python scripts, as well as Some Notable errors I see when running ####->datahub --debug actions -c custom_action.yaml BTW, I did not see any relevant posts on slack related to this, hopefully this is not just PEBKAT :-)

CustomActions.txt

kind-lunch-14188

09/09/2022, 2:05 PM

Hi All, I’m generating json file with schema:

datahub get --urn "urn:li:dataset:(urn:li:dataPlatform:bigquery,project_id.dataset.table,DEV)" --aspect schemaMetadata  | jq >schema.json

Then I was trying to upload this schema to new&“empty” datahub. To achieve that I tried with

datahub ingest -c ingest_schema.dhub.yaml

, where

ingest_schema.dhub.yaml

is as follow:

Copy code

# see <https://datahubproject.io/docs/generated/ingestion/sources/file> for complete documentation
source:
  type: "file"
  config:
    filename: schema.json

# see <https://datahubproject.io/docs/metadata-ingestion/sink_docs/datahub> for complete documentation
sink:
  type: "datahub-rest"
  config:
    server: "<https://gms.topic-dev.name.dev>"

anyway I’m still receiving the same error:

Copy code

Command failed with com.linkedin.pegasus2avro.usage.UsageAggregation is missing required field: bucket.

In fact there is no

UsageAggregation

in generated json file. My good guess I’m generating json file in wrong way. Are you able to give me a hint how to do it correctly?

brainy-tent-14503

09/09/2022, 2:41 PM

👋 Hello friends! - I am building and testing datahub locally and recently the

spark-lineage

smoke tests fail because of a difference in the browse path. I am working off the main branch with only a few slight fixes to get the test to run (see diff). I am wondering if the golden json files might be out of date from what is expected or if something else is off. For example the tests expected this

Copy code

'com.linkedin.common.BrowsePaths': {
					'paths': ['/spark/spark_spark-master_7077/pythonhdfsin2hivecreatetable']
				}

however the actual looks like this with the

/pythonhdfsin2hivecreatetable

part missing.

Copy code

'com.linkedin.common.BrowsePaths': {
					'paths': ['/spark/spark_spark-master_7077']
				}

the part that is missing is seen in the properties as

appName

so ... maybe I should update the expected golden json?

billowy-truck-48700

09/09/2022, 7:16 PM

Hi All, I have question regarding SqlAlchemy UI Ingestion if I need to add teradatasqlalchemy dialect. I created own image based based on acryldata/datahub-actions image where I installed acryl-datahub[sqlalchemy] + teradatasqlalchemy and CLI ingestion works. But UI ingestion create virtual env for each ingestion type under this folder /tmp/datahub/ingest Is there a way how to define for datahub-actions what other python libraries to install as part of virtual env. creation?

bitter-lizard-32293

09/09/2022, 8:00 PM

👋 Hey folks. We've been trying to build some functionalities on top of lineage search (the

searchAcrossLineage

graphQL query) and we've been seeing super high latencies (> 10s) which executing that query. We spent some time digging into things and it looks like we're spending the bulk of our time in the getLineage call in the ESGraphQueryDao class (as we use ES as our graph store too). I did find one minor bug in that the search lineage results were meant to be cached but that is actually not being done - https://github.com/datahub-project/datahub/pull/5892. This does help us fix repeated calls for the same URN, but first time calls are still taking a while. Does anyone have any recommendations on how we could tune / speed things up here? Ballparks wise our

graph_service_v1

index has around 36M docs (4.8GB on disk) and is currently running 1 shard and 1 replica (wonder if this is too low)

great-branch-515

09/11/2022, 7:06 PM

@here We are running datahub on EKS (Kubernetes). We want to setup newrelic monitoring for datahub. Any pointers?

late-rocket-94535

09/12/2022, 7:45 AM

Hi, team. I use transformer for owenership in yaml like

Copy code

transformers:
  - type: "simple_add_dataset_ownership"
    config:
      owner_urns:
        - "urn:li:corpGroup:edwh"
      ownership_type: "TECHNICAL_OWNER"

and after last upgrading datahub (v0.8.44) I see errors like

[2022-09-12, 07:08:52 UTC] {process_utils.py:173} INFO - [[34m2022-09-12, 07:08:52 UTC[0m] {[34mpipeline.py:[0m57} ERROR[0m -  failed to write record with workunit txform-urn:li:dataPlatform:postgres-edwh.raw_tn_marketing.t_customer-TEST-ownership with ('Unable to emit metadata to DataHub GMS', {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException', 'stackTrace': 'com.linkedin.restli.server.RestLiServiceException [HTTP Status:422]: Failed to validate record with class com.linkedin.common.Ownership: ERROR :: /lastModified/actor :: "Provided urn " is invalid\n\n\tat com.linkedin.metadata.resources.entity.AspectResource.lambda$ingestProposal$3(AspectResource.java:149)', 'message': 'Failed to validate record with class com.linkedin.common.Ownership: ERROR :: /lastModified/actor :: "Provided urn " is invalid', 'status': 422}) and info {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException', 'stackTrace': 'com.linkedin.restli.server.RestLiServiceException [HTTP Status:422]: Failed to validate record with class com.linkedin.common.Ownership: ERROR :: /lastModified/actor :: "Provided urn " is invalid\n\n\tat com.linkedin.metadata.resources.entity.AspectResource.lambda$ingestProposal$3(AspectResource.java:149)', 'message': 'Failed to validate record with class com.linkedin.common.Ownership: ERROR :: /lastModified/actor :: "Provided urn " is invalid', 'status': 422}[0m

Is it bug or not?

sparse-quill-63288

09/12/2022, 12:27 PM

Hi Everyone, I have configured my snowflake account to datahub. But I am getting a below error and also I am not able to see the lineage part could you please help me on it. I tried locating what is the error, unfortunately I was not able to locate it. Errors: Validation error of type FieldUndefined: Field 'latestVersion' in type 'GetSchemaBlameResult' is undefined @ 'getSchemaBlame/latestVersion' (code undefined) The variables input contains a field name 'categories' that is not defined for input object type 'GetSchemaBlameInput' (code undefined)

witty-journalist-16013

09/12/2022, 3:50 PM

On 0.8.43.6 and the dbt config with S3 connection doesn't seem to work. Throws a vaildation saying

aws_connection

is an extra field.

witty-journalist-16013

09/12/2022, 3:52 PM

Copy code

'1 validation error for DBTConfig\n'
           'aws_connection\n'
           '  extra fields not permitted (type=value_error.extra)\n',

witty-journalist-16013

09/12/2022, 3:57 PM

I'm using acryldata/datahub-actions 0.0.6 and it looks like that has 0.8.33 of the datahub cli tools in it?

early-airplane-84388

09/12/2022, 6:24 PM

Hey Team Sharing the error from the from the #ingestion channel thread as @steep-midnight-37232 also reported it for 0.8.44. Getting similar error for multiple integrations.

Copy code

~~~~ Execution Summary ~~~~

RUN_INGEST - {'errors': [],
 'exec_id': '8cc88fff-6200-465e-9d09-703160d417fc',
 'infos': ['2022-09-12 13:20:11.799891 [exec_id=8cc88fff-6200-465e-9d09-703160d417fc] INFO: Starting execution for task with name=RUN_INGEST',
           '2022-09-12 13:20:11.801144 [exec_id=8cc88fff-6200-465e-9d09-703160d417fc] INFO: Caught exception EXECUTING '
           'task_id=8cc88fff-6200-465e-9d09-703160d417fc, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/default_executor.py", line 121, in execute_task\n'
           '    self.event_loop.run_until_complete(task_future)\n'
           '  File "/usr/local/lib/python3.9/site-packages/nest_asyncio.py", line 89, in run_until_complete\n'
           '    return f.result()\n'
           '  File "/usr/local/lib/python3.9/asyncio/futures.py", line 201, in result\n'
           '    raise self._exception\n'
           '  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 256, in __step\n'
           '    result = coro.send(None)\n'
           '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 71, in execute\n'
           '    validated_args = SubProcessIngestionTaskArgs.parse_obj(args)\n'
           '  File "pydantic/main.py", line 521, in pydantic.main.BaseModel.parse_obj\n'
           '  File "pydantic/main.py", line 341, in pydantic.main.BaseModel.__init__\n'
           'pydantic.error_wrappers.ValidationError: 1 validation error for SubProcessIngestionTaskArgs\n'
           'debug_mode\n'
           '  extra fields not permitted (type=value_error.extra)\n']}
Execution finished with errors.

I'm running DataHub in GCP Kubernetes Engine and have also tried uninstalling and reinstalling with helm.

thousands-solstice-2498

09/13/2022, 5:51 AM

Hi Team, please advise. Error while executing config command with args '--command-config /tmp/connection.properties --bootstrap-server kafka-886515205-1-1200869063.scus.kafka-sams-edf-stg.ms-df-messaging.stg-az-southcentralus-8.prod.us.walmart.net:9092,kafka-886515205-2-1200869066.scus.kafka-sams-edf-stg.ms-df-messaging.stg-az-southcentralus-8.prod.us.walmart.net:9092,kafka-886515205-3-1200869069.scus.kafka-sams-edf-stg.ms-df-messaging.stg-az-southcentralus-8.prod.us.walmart.net:9092,kafka-886515205-4-1200869072.scus.kafka-sams-edf-stg.ms-df-messaging.stg-az-southcentralus-8.prod.us.walmart.net:9092,kafka-886515205-5-1200869075.scus.kafka-sams-edf-stg.ms-df-messaging.stg-az-southcentralus-8.prod.us.walmart.net:9092,kafka-886515205-6-1200869078.scus.kafka-sams-edf-stg.ms-df-messaging.stg-az-southcentralus-8.prod.us.walmart.net:9092 --entity-type topics --entity-name _schemas --alter --add-config cleanup.policy=compact'

great-branch-515

09/13/2022, 6:22 AM

@here I am trying build this grafana dashboard https://github.com/datahub-project/datahub/tree/master/docker/monitoring/grafana/dashboards I am not finding some of the metrics. (for example metrics_com_linkedin_metadata_resources_entity_EntityResource_search_Mean) Where I can find list of datahub metrics emitted by each service?

fresh-cricket-75926

09/13/2022, 8:31 AM

Hi All, while adding member to an group , i m getting below error in datahub gms and was unable to add. Can anyone explain us what might be the issue here. ERROR c.l.d.g.e.DataHubDataFetcherExceptionHandler:21 - Failed to execute DataFetcher java.util.concurrent.CompletionException: java.lang.RuntimeException: Failed to migrate group membership for group urnlicorpGroup:c9a62167-dab3-4abd-a401-148feb10a8c4 when adding group members at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273) at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280) at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1606) at java.util.concurrent.CompletableFuture$AsyncSupply.exec(CompletableFuture.java:1596) at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175) Caused by: java.lang.RuntimeException: Failed to migrate group membership for group urnlicorpGroup:c9a62167-dab3-4abd-a401-148feb10a8c4 when adding group members at com.linkedin.datahub.graphql.resolvers.group.AddGroupMembersResolver.lambda$get$1(AddGroupMembersResolver.java:63) at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604) ... 5 common frames omittedERROR c.datahub.graphql.GraphQLController:98 - Errors while executing graphQL query: "mutation addGroupMembers($groupUrn: String!, $userUrns: [String!]!) {\n addGroupMembers(input: {groupUrn: $groupUrn, userUrns: $userUrns})\n}\n", result: {errors=[{message=An unknown error occurred., locations=[{line=2, column=3}], path=[addGroupMembers], extensions={code=500, type=SERVER_ERROR, classification=DataFetchingException}}], data={addGroupMembers=null}, extensions={tracing={version=1, startTime=2022-09-13T081908.065Z, endTime=2022-09-13T081908.083Z, duration=17505949, parsing={startOffset=221585, duration=201921}, validation={startOffset=384735, duration=150709}, execution={resolvers=[{path=[addGroupMembers], parentType=Mutation, returnType=Boolean, fieldName=addGroupMembers, startOffset=446211, duration=16737643}]}}}}, errors: [DataHubGraphQLError{path=[addGroupMembers], code=SERVER_ERROR, locations=[SourceLocation{line=2, column=3}]}] 081910.157 [ForkJoinPool.commonPool-worker-7] WARN org.elasticsearch.client.RestClient:65 - request [POST http://elasticsearch-master:9200/graph_service_v1/_search?typed_keys=true&max_concurrent_shard_requests=5&ignore_unavailable=false&expand_wildcards=open&allow_no_indices=true&ignore_throttled=true&search_type=query_then_fetch&batched_reduce_size=512&ccs_minimize_roundtrips=true] returned 2 warnings: [299 Elasticsearch-7.16.2-2b937c44140b6559905130a8650c64dbd0879cfb "Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.16/security-minimal-setup.html to enable security."],[299 Elasticsearch-7.16.2-2b937c44140b6559905130a8650c64dbd0879cfb "[ignore_throttled] parameter is deprecated because frozen indices have been deprecated. Consider cold or frozen tiers in place of frozen indices."]

lively-jackal-83760

09/13/2022, 11:43 AM

Hi guys. Who knows how to shade or exclude springframework from package io.cryl:datahub-client on the latest version? Even don't see how and why the client library needs Spring, but it conflicted with my own

hallowed-dog-79615

09/13/2022, 2:57 PM

Hi channel! We just updated our DataHub instance to v0.8.44 and found this behavior: we cannot see new users or policies anymore. These are the steps that we are following: 1. We generate an invite link as in previous months. 2. New users use it to register. 3. Then they notify me so I can assign them a group. Groups have associated policies. 4. I cannot find new users in the user list. Old users are there, but none of the new ones appear. 5. They can login into the platform, but receive a "unauthorized" page, as by defautl they have no access rights. 6. If I activate some "all users" policy, already existing but Inactive, they can access everythin. 7. Still cannot see them in user list. 8. I try to create a new policy a bit more restrictive than the default one, but after successfully creating it, it never appears in the list. Has anyone experienced something similar? Could you fix it? Thanks!

great-branch-515

09/02/2022, 10:12 AM

@here Roles page does show any role. Any idea about it? There is no error in gms service logs and linkedin/datahub-gms:v0.8.44 is deployed

kind-whale-32412

09/13/2022, 11:42 PM

Hi, I discovered a bug in DataHub where I get an exception (

Result window is too large

) for paths that have more than 10,000 assets. I created an issue explaining it in details: https://github.com/datahub-project/datahub/issues/5928

famous-florist-7218

09/14/2022, 8:11 AM

Hi guys, Is there a way to enable Spark Lineage debug log? This setting doesn’t work for me 😞

Copy code

log4j.logger.datahub.spark=DEBUG
log4j.logger.datahub.client.rest=DEBUG

I’ve deployed a spark k8s operator with the latest datahub-spark-lineage jar (0.8.44-2). It has failed to emit metadata to GMS. The error is

Application end event received, but start event missing for appId spark-xxx

. So that why I need to enable the log4j debugging mode.

purple-balloon-66501

09/14/2022, 8:53 AM

Good afternoon, advise in which direction to look: I do not see fresh runs in ingestion in the logs of the acrylic everything looks fine and at manual start are executed, but the frontend displays data only for yesterday.

limited-forest-73733

09/14/2022, 10:46 AM

Hey team! I just want to know is there any update on this one https://github.com/datahub-project/datahub/issues/4809 there is a compatibility issue with airflow:2.3.1.

witty-lamp-55264

09/14/2022, 12:46 PM

Hello everyone, I am trying to setup datahub on our private cluster (using kubeadm), but everytime I try to install the helm chart I get:

Error: INSTALLATION FAILED: unable to build kubernetes objects from release manifest: unable to recognize "": no matches for kind "PodDisruptionBudget" in version "policy/v1beta1"

After searching, I found that kubernetes latest versions don't support some type of api's such as

PodDisruptionBudget

. I wanna see if there will be a newer version that works with the latest versions of kubernetes or is there another solution. Slack Conversation

witty-lamp-55264

09/14/2022, 12:46 PM

Any help would be appreciated, I am still stuck at this issue

salmon-angle-92685

09/14/2022, 12:51 PM

Hello, Schema information not being ingested for snowflake tables. I have the tables on the UI but not the columns metadata. I have as well the following erros (attached). Any ideia on how to fix this ? See here my ingestions yaml:

Copy code

source:
  type: snowflake
  config:
    account_id: ${DH_SNOWFLAKE_ACCOUNT_ID}
    warehouse: ${DH_SNOWFLAKE_WAREHOUSE}
    username: ${DH_SNOWFLAKE_USER}
    password: ${DH_SNOWFLAKE_PASSWORD}
    role: ${DH_SNOWFLAKE_ROLE}
    include_tables: True
    include_views: True

    ignore_start_time_lineage: true

    stateful_ingestion:
      enabled: True 
      remove_stale_metadata: True 

    profiling:
      enabled: true
    profile_pattern:
      allow:
        - 'DATABASE_NAME.SCHEMA_NAME.*'
    database_pattern:
      allow:
        - "OIP"
    schema_pattern:
      allow:
        - "SCHEMA_NAME"
    table_pattern:
      deny:
       - '.*\._AIRBYTE_.*'

pipeline_name: "snowflake_ingestion"

sink:
  type: datahub-rest
  config:
    server: ${DATAHUB_SERVER}

Thank you guys in advance !

bright-diamond-60933

09/14/2022, 4:02 PM

When we enable standalone consumers, we are seeing duplicate elastic search indices and timestamps being appended to elastic search index names, causing the number of indices to explode. We observed this with both versions 0.8.41 and the latest 0.8.44. Is there a workaround for this? Is this expected behavior or a bug? Please see screenshot below. You will notice that the index

mlfeatureindex_v2

is duplicated with different timestamps appended to the name. Sometimes we also noticed that the pod doesn't even start up when the standalone consumers flag is enabled :