DataHub #troubleshoot

numerous-address-22061

08/31/2023, 6:40 PM

this doesnt look right

✅ 1

able-library-93578

08/31/2023, 11:21 PM

Hi @witty-plumber-82249, not quite sure where to post this, but I will put it here. I have datahub deployed on K8, all working no issues there. I have ingested some metadata using the UI for PowerBI. With the ingestion I got some owners attached to the assets, great! I then configured datahub for SSO with OIDC with OKTA, got that working it was pretty easy. Now here is the issue. There are two entries for the same user (exactly the same email address). One entry has the assets attached to it, the other is the SSO profile. It seems the SSO and ingestion did not reconcile. Should I delete all PBI metadata and bring it in fresh?

fierce-doctor-85079

09/01/2023, 7:15 AM

Hello everyone, I would like to know how to import a multi-level business vocabulary, such as having two term groups in the second level

fierce-doctor-85079

09/01/2023, 7:16 AM

image.png

fierce-doctor-85079

09/01/2023, 7:17 AM

I will only recognize the last term group after importing through the above file. Do you know how to modify it? thanks

bland-orange-13353

09/01/2023, 7:50 AM

This message was deleted.

late-addition-48515

09/01/2023, 8:27 AM

Hey everyone, I am trying to post dataset lineage using the rest emitter, but when I try setting two or more upstream urns only lineage to the last in the list of upstreams is posted. Any ideas why?

Copy code

def _post_lineage(self, parents, child):
        # Implement the API call here

        lineage_mce = builder.make_lineage_mce(
        [builder.make_dataset_urn("sbx-ml", 'dataset_1'), 
         builder.make_dataset_urn("sbx-ml", 'dataset_2')], # upstream
        builder.make_dataset_urn("sbx-ml", 'dataset_3),  # downstream
        )
        # Create an emitter to the GMS REST API.
        emitter = DatahubRestEmitter("<http://34:8080>")
        # Emit metadata!
        emitter.emit_mce(lineage_mce)

purple-refrigerator-27989

09/04/2023, 6:12 AM

Hello everyone, I want to debug data-ingestion code to make clear how the code works when ingestion of mysql data, but I can't find the corresponding main program, only found mysql.py, can anyone help?

purple-refrigerator-27989

09/04/2023, 7:35 AM

Hi everyone, I ran into some problems while looking at the source code and found that the from call did not recognize the "datahub" module

future-yak-13169

09/05/2023, 2:18 AM

Hi Guys - requesting assistance from anyone who can help figure out why we are consistently having elasticsearch indexing problems very frequently. The datahub version doesnt matter, we have been having this problem since v0.10 itself, we are now on 10.5. We have a total of about 50k datasets spread out ove multiple platforms. We have deployed datahub using helm charts, but the MySQL backend DB is outside the kubernetes cluster, its a managed service on-premise. The ELasticsearch is running within the k8s cluster in 3 replicas. Whenever we ingest new data, it doesnt show up on the UI. But its accessible via direct URL. We try running the restore indices job and it completes w/o errors, but no change on the UI. THere is no resource problem I see within the k8s cluster, elasticsearch,GMS have sufficient resources to work with. The elasticsearch is setup as both master and node according to the https://github.com/acryldata/datahub-helm/blob/master/charts/prerequisites/values.yaml

datahub-frontend:

image: repository: imagePullSecrets: - name: resources: requests: memory: 1Gi cpu: 500m limits: memory: 1Gi cpu: 500m datahub-gms: image: repository: imagePullSecrets: - name: resources: requests: cpu: 1000m memory: 2Gi limits: cpu: 1000m memory: 4Gi livenessProbe: initialDelaySeconds: 120 readinessProbe: initialDelaySeconds: 120 extraEnvs: - name: DATAHUB_TELEMETRY_ENABLED value: "false" - name: EBEAN_MAX_CONNECTIONS value: "400" - name: EBEAN_WAIT_TIMEOUT_MILLIS value: "9000" elasticsearchSetupJob: image: repository: resources: limits: cpu: 250m memory: 512Mi requests: cpu: 250m memory: 512Mi kafkaSetupJob: image: repository: resources: limits: cpu: 1000m memory: 1024Mi requests: cpu: 1000m memory: 1024Mi datahubUpgrade: enabled: true image: repository: imagePullSecrets: - name: resources: limits: cpu: 250m memory: 256Mi requests: cpu: 250m memory: 256Mi restoreIndices: resources: limits: cpu: 800m memory: 3Gi requests: cpu: 500m memory: 2Gi esJavaOpts: "-Xmx2048m -Xms2048m" datahubSystemUpdate: image: repository: podSecurityContext: {} securityContext: {} podAnnotations: {} resources: limits: cpu: 2000m memory: 2048Mi requests: cpu: 1000m memory: 1024Mi global: graph_service_impl: elasticsearch sql: datasource: host: hostForMysqlClient: url: username: password: secretRef: mysql-secrets secretKey: mysql-root-password kafka: schemaregistry: url: "http://prerequisites-cp-schema-registry:8081" type: KAFKA datahub: version: v0.10.4 metadata_service_authentication: enabled: true ------------------------------------------------------------------- elasticsearch: image: imagePullSecrets: - name: sysInitContainer: enabled: false sysctlInitContainer: enabled: false esJavaOpts: "-Xmx2048m -Xms2048m" replicas: 3 resources: requests: cpu: 100m memory: 2Gi limits: cpu: 200m memory: 4Gi livenessProbe: initialDelaySeconds: 120 readinessProbe: initialDelaySeconds: 120 kafka: global: imageRegistry: imagePullSecrets: - image: registry: repository: bitnami/kafka pullSecrets: resources: requests: cpu: 500m memory: 512Mi limits: cpu: 1000m memory: 1Gi livenessProbe: initialDelaySeconds: 120 readinessProbe: initialDelaySeconds: 120 persistence: enabled: true storageClass: "nas" accessModes: - ReadWriteOnce size: 200Gi

busy-analyst-35820

09/05/2023, 4:59 AM

Hi Team, We face the below error "500 Unknown error" when we perform partial text search. We couldnt find anything specific in the ElasticsEarch log. Can you please help us here.

fierce-doctor-85079

09/05/2023, 5:17 AM

Hello everyone, I would like to ask how to batch modify the business vocabulary through yml files

fierce-doctor-85079

09/05/2023, 6:02 AM

0.10.5

bland-orange-95847

09/05/2023, 6:11 AM

Hi we are facing authorization issues with

METADATA_SERVICE_AUTH

enabled and group ownership policies. Its hard to describe but if you have some ideas in that area please have a look at this GitHub issue https://github.com/datahub-project/datahub/issues/8781 appreciate any help. For me it looks like at some place too many data is fetched and one indirection is not resolved correctly, but maybe I am missing something 🙂

future-yak-13169

09/05/2023, 9:51 AM

We keep getting this I/O Reactor STOPPED errors in GMS backend and it redeploys itself.

Caused by: java.lang.RuntimeException: Request cannot be executed; I/O reactor status: STOPPED

at org.elasticsearch.client.RestClient.extractAndWrapCause(RestClient.java:887)

at org.elasticsearch.client.RestClient.performRequest(RestClient.java:283)

at org.elasticsearch.client.RestClient.performRequest(RestClient.java:270)

at org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1632)

at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1602)

at org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:1572)

at org.elasticsearch.client.RestHighLevelClient.search(RestHighLevelClient.java:1088)

at com.linkedin.metadata.search.elasticsearch.query.ESSearchDAO.executeAndExtract(ESSearchDAO.java:87)

... 13 common frames omitted

Caused by: java.lang.IllegalStateException: Request cannot be executed; I/O reactor status: STOPPED

at org.apache.http.util.Asserts.check(Asserts.java:46)

at org.apache.http.impl.nio.client.CloseableHttpAsyncClientBase.ensureRunning(CloseableHttpAsyncClientBase.java:90)

at org.apache.http.impl.nio.client.InternalHttpAsyncClient.execute(InternalHttpAsyncClient.java:123)

at org.elasticsearch.client.RestClient.performRequest(RestClient.java:279)

... 19 common frames omitted

Any advice on why this could be happening ? something to do with elasticsearch ?

quick-pizza-8906

09/05/2023, 11:19 AM

Hello, I have a question related to some things introduced by

0.10.x

version of Datahub. I recently got hurt by

searchAcrossEntities

facets (buckets) count limited to 20. I found 2 related settings to this: 1. Environment variable

ELASTICSEARCH_QUERY_MAX_TERM_BUCKET_SIZE

searchFlags.maxAggValues

input parameter for the query What I have noticed is that changing value of

ELASTICSEARCH_QUERY_MAX_TERM_BUCKET_SIZE

does not change actual limit of returned bucket count, while changing

searchFlags.maxAggValues

(at least in version

0.10.5

) actually changes the bucket count limit. I am a bit confused what is intended relation between the env variable and the query input parameter? This puzzles me especially, considering the query builder: https://github.com/datahub-project/datahub/blob/master/metadata-io/src/main/java/c[…]ta/search/elasticsearch/query/request/SearchRequestHandler.java Does not use

finalSearchFlags

when building aggregations as well as does not seem to use parameters coming from

ELASTICSEARCH_QUERY_MAX_TERM_BUCKET_SIZE

. What am I missing here?

mysterious-advantage-78411

09/05/2023, 1:19 PM

Hi Guys! Could somebody share your database resourses requaired like cpu, drive and ram for your datahub instance? (of course total count of datasets in your datahub as reference metric). regulary receive "Unable to emit metadata to Data Hub GMS: com.datahub.util.exception.RetryLimitReached: Failed to add after 4 retries" but we have lattest version of action module. don't understood why it still happend. Today we try to reduce this incident by divorcing ingestions in runtime to avoid same time running. Anyway it seems like scalability issue due to we plan to add more than 500 ingestions it is ~10x grow from today. Thereafter it will be difficult to make ingestion divorcing by runtime. Any ideas?

👀 1

chilly-potato-57465

09/05/2023, 1:38 PM

Hello! I am wondering how to implement the following. I would like to give users access equivalent to the Reader role to everything but datasets marked with a certain domain. For instance, I create a domain called Sensitive and annotate relevant datasets with it. Then the Reader role should be able to see all dataset with the exception of the datasets marked with Sensitive. As far as I understand, I have to implement a policy which explicitly gives access to everything else but those datasets. That is I can't create a policy to exclude certain domain but rather have to create a policy to include everything except it. Is this understanding correct? Thank you!

big-nightfall-99541

09/05/2023, 1:55 PM

Hi everyone!! I'm trying to emit a custom lineage between an MlModelGroup and a MlFeatureTable, but I'm facing some problems when executing the script

'Unable to emit metadata to DataHub GMS: java.lang.RuntimeException: Unknown aspect upstreamLineage for entity mlmodelgroup'

What I'm doing wrong? [In the thread there is the full script and traceback] Thank you!

gentle-gold-63488

09/05/2023, 3:53 PM

Hi! How are you there? I've a question, how can i add Primary Key and Foreing Key on a S3 dataset?

gentle-gold-63488

09/05/2023, 3:53 PM

image.png

colossal-football-58924

09/05/2023, 5:51 PM

Hello, I am using the quickstart guide to run datahub in docker. I get a series of error messages when I execute datahub docker quickstart. The first error is: Unable to connect to GitHub, using default quickstart version mapping config. Can someone please assist with this error. Thank you in advance.

able-library-93578

09/05/2023, 7:28 PM

Hello @witty-plumber-82249, I had a deployed working version of DataHub v0.10.5.5. After tinkering with several config setting for SSO, GMS authentication. I decided to get it back to the a normal state as deployed by my helm charts. I am getting issues in re-deployment on the prerequisites. More specifically mysql - startup probe. I get the below error. I am guessing it has something to do with the pvc, but not sure. I dont want to nuke everything as I have things I do not want to lose. Any guidance will be appreciated.

broad-grass-53166

09/05/2023, 9:24 PM

Hi, I am looking to run and debug the metadata-ingestion in my local but unable to due to the below error.

Traceback (most recent call last):

File "/home/asaniya/code/datahub-master/metadata-ingestion/src/datahub/entrypoints.py", line 10, in <module>

from datahub.cli.check_cli import check

File "/home/asaniya/code/datahub-master/metadata-ingestion/src/datahub/cli/check_cli.py", line 13, in <module>

from datahub.ingestion.run.pipeline import Pipeline

File "/home/asaniya/code/datahub-master/metadata-ingestion/src/datahub/ingestion/run/pipeline.py", line 29, in <module>

from datahub.ingestion.extractor.extractor_registry import extractor_registry

File "/home/asaniya/code/datahub-master/metadata-ingestion/src/datahub/ingestion/extractor/extractor_registry.py", line 1, in <module>

from datahub.ingestion.api.registry import PluginRegistry

File "/home/asaniya/code/datahub-master/metadata-ingestion/src/datahub/ingestion/api/registry.py", line 18, in <module>

import entrypoints

File "/home/asaniya/code/datahub-master/metadata-ingestion/src/datahub/entrypoints.py", line 10, in <module>

from datahub.cli.check_cli import check

ImportError: cannot import name 'check' from partially initialized module 'datahub.cli.check_cli' (most likely due to a circular import) (/home/asaniya/code/datahub-master/metadata-ingestion/src/datahub/cli/check_cli.py)

I have already tried following the steps noted in below documentation: https://datahubproject.io/docs/metadata-ingestion/developing/#requirements I am at a point where I am build the code but unable to run it. Ideally, I would like to run this in IntelliJ. Could you please help to resolve the above issue? CC: @hundreds-photographer-13496

clever-dinner-20353

09/06/2023, 4:02 AM

Hello, I'm currently facing an issue where

inlets

are not showing up on DataHub Here is the code

Copy code

task1 = BashOperator(
        task_id="run_data_task",
        dag=dag,
        bash_command="echo 'This is where you might run your data tooling.'",
        inlets=[
            Dataset(platform="snowflake", name="mydb.schema.tableA"),
            Dataset(platform="snowflake", name="mydb.schema.tableB", env="DEV"),
            Dataset(
                platform="snowflake",
                name="mydb.schema.tableC",
                platform_instance="cloud",
            ),
            # You can also put dataset URNs in the inlets/outlets lists.
            Urn(
                "urn:li:dataset:(urn:li:dataPlatform:snowflake,mydb.schema.tableC,PROD)"
            ),
        ],
        outlets=[Dataset("snowflake", "mydb.schema.tableD")],
    )

and here is the lineage. It should show all the previous snowflake datasets..

✅ 1

able-library-93578

09/06/2023, 10:21 PM

Hi All, I have followed the steps for the datahub-actions "hello_world". Example. I have a datahub deployment in AKS through helm charts, so all of the default naming is still intact, nothing custom. I have OIDC active, and the

METADATA_SERVICE_AUTH_ENABLED

active as well. Below is my yaml for the action:

Copy code

# hello_world.yaml
name: "hello_world"
source:
  type: "kafka"
  config:
    connection:
      bootstrap: ${KAFKA_BOOTSTRAP_SERVER:-prerequisites-kafka:9092}
      schema_registry_url: ${SCHEMA_REGISTRY_URL:-<http://prerequisites-cp-schema-registry:8081>}
filter:
  event_type: "EntityChangeEvent_v1"
  event:
    category: "TAG"
    operation: [ "ADD", "REMOVE" ]
    modifier: "urn:li:tag:SourcesSDP"
action:
  type: "hello_world"
datahub:
  server: "<https://my-datahub-domain.com/api/gms>"
  token: "my-token"

Here is my logs from the cli:

Copy code

datahub actions -c hello_world.yaml                                           
[2023-09-06 15:15:33,421] INFO     {datahub_actions.cli.actions:76} - DataHub Actions version: 0.0.13
[2023-09-06 15:15:34,298] INFO     {datahub_actions.cli.actions:119} - Action Pipeline with name 'hello_world' is now running.
%3|1694038534.460|FAIL|rdkafka#consumer-1| [thrd:prerequisites-kafka:9092/bootstrap]: prerequisites-kafka:9092/bootstrap: Failed to resolve 'prerequisites-kafka:9092': nodename nor servname provided, or not known (after 179ms in state CONNECT)
%3|1694038536.289|FAIL|rdkafka#consumer-1| [thrd:prerequisites-kafka:9092/bootstrap]: prerequisites-kafka:9092/bootstrap: Failed to resolve 'prerequisites-kafka:9092': nodename nor servname provided, or not known (after 3ms in state CONNECT, 1 identical error(s) suppressed)
%3|1694038567.357|FAIL|rdkafka#consumer-1| [thrd:prerequisites-kafka:9092/bootstrap]: prerequisites-kafka:9092/bootstrap: Failed to resolve 'prerequisites-kafka:9092': nodename nor servname provided, or not known (after 3ms in state CONNECT, 16 identical error(s) suppressed)
%3|1694038597.424|FAIL|rdkafka#consumer-1| [thrd:prerequisites-kafka:9092/bootstrap]: prerequisites-kafka:9092/bootstrap: Failed to resolve 'prerequisites-kafka:9092': nodename nor servname provided, or not known (after 3ms in state CONNECT, 15 identical error(s) suppressed)
%3|1694038627.493|FAIL|rdkafka#consumer-1| [thrd:prerequisites-kafka:9092/bootstrap]: prerequisites-kafka:9092/bootstrap: Failed to resolve 'prerequisites-kafka:9092': nodename nor servname provided, or not known (after 3ms in state CONNECT, 15 identical error(s) suppressed)
%3|1694038657.561|FAIL|rdkafka#consumer-1| [thrd:prerequisites-kafka:9092/bootstrap]: prerequisites-kafka:9092/bootstrap: Failed to resolve 'prerequisites-kafka:9092': nodename nor servname provided, or not known (after 3ms in state CONNECT, 15 identical error(s) suppressed)
%3|1694038687.640|FAIL|rdkafka#consumer-1| [thrd:prerequisites-kafka:9092/bootstrap]: prerequisites-kafka:9092/bootstrap: Failed to resolve 'prerequisites-kafka:9092': nodename nor servname provided, or not known (after 3ms in state CONNECT, 15 identical error(s) suppressed)
%3|1694038717.702|FAIL|rdkafka#consumer-1| [thrd:prerequisites-kafka:9092/bootstrap]: prerequisites-kafka:9092/bootstrap: Failed to resolve 'prerequisites-kafka:9092': nodename nor servname provided, or not known (after 3ms in state CONNECT, 15 identical error(s) suppressed)
%3|1694038747.775|FAIL|rdkafka#consumer-1| [thrd:prerequisites-kafka:9092/bootstrap]: prerequisites-kafka:9092/bootstrap: Failed to resolve 'prerequisites-kafka:9092': nodename nor servname provided, or not known (after 3ms in state CONNECT, 15 identical error(s) suppressed)
%3|1694038778.840|FAIL|rdkafka#consumer-1| [thrd:prerequisites-kafka:9092/bootstrap]: prerequisites-kafka:9092/bootstrap: Failed to resolve 'prerequisites-kafka:9092': nodename nor servname provided, or not known (after 3ms in state CONNECT, 16 identical error(s) suppressed)
%3|1694038809.898|FAIL|rdkafka#consumer-1| [thrd:prerequisites-kafka:9092/bootstrap]: prerequisites-kafka:9092/bootstrap: Failed to resolve 'prerequisites-kafka:9092': nodename nor servname provided, or not known (after 3ms in state CONNECT, 16 identical error(s) suppressed)
^C[2023-09-06 15:20:31,393] INFO     {datahub_actions.cli.actions:137} - Stopping all running Action Pipelines...
[2023-09-06 15:20:32,803] INFO     {datahub_actions.plugin.source.kafka.kafka_event_source:178} - Kafka consumer exiting main loop
[2023-09-06 15:20:32,804] INFO     {datahub_actions.pipeline.pipeline_manager:81} - Actions Pipeline with name 'hello_world' has been stopped.

Pipeline Report for hello_world

Started at: 2023-09-06 15:15:34.297000 (Local Time)
Duration: 298.508s

Pipeline statistics

{
    "started_at": 1694038534297
}

Action statistics

{}

Any advice on what to tweak is greatly appreciated.

best-laptop-39921

09/07/2023, 2:00 AM

Hello, I've recently upgraded DataHub from version 0.10.2 to 0.10.5. However, I've encountered an issue when attempting advanced queries in the web UI, such as (

\q fieldPaths: column_name

), as it doesn't work. Only

\q name:

works. Any advice would be greatly appreciated. Thank you. :) (I used helm chart. --> any settings for advanced query...?)

quiet-arm-91745

09/07/2023, 8:09 AM

is datahub helm chart expose annotation block? as i want to add this annotation to be able use GCS FUSE CSI

Copy code

metadata:
  annotations:
    gke-gcsfuse/volumes: "true"

otherwise i can't mount gcs bucket as volume thanks in advance

bitter-florist-92385

09/07/2023, 8:31 AM

Hey there, i am currently trying to install the python sdk. I installed the datahub package via pip. But when i try to import modules like :

Copy code

from datahub import DataHubClient, MetadataChangeEvent

i get an Import Error. Is the package not complete, or am i missing something else ?

mysterious-advantage-78411

09/07/2023, 9:34 AM

Hi Guys, is there a way to increase s3 timeout in ingestion to avoid this error: botocore.exceptions.ReadTimeoutError: Read timeout on endpoint URL: .....? some backet can not be investigated due to this error.