https://datahubproject.io logo
Join Slack
Powered by
# troubleshoot
  • n

    numerous-address-22061

    08/31/2023, 6:40 PM
    this doesnt look right
    ✅ 1
    d
    a
    g
    • 4
    • 3
  • a

    able-library-93578

    08/31/2023, 11:21 PM
    Hi @witty-plumber-82249, not quite sure where to post this, but I will put it here. I have datahub deployed on K8, all working no issues there. I have ingested some metadata using the UI for PowerBI. With the ingestion I got some owners attached to the assets, great! I then configured datahub for SSO with OIDC with OKTA, got that working it was pretty easy. Now here is the issue. There are two entries for the same user (exactly the same email address). One entry has the assets attached to it, the other is the SSO profile. It seems the SSO and ingestion did not reconcile. Should I delete all PBI metadata and bring it in fresh?
    a
    • 2
    • 4
  • f

    fierce-doctor-85079

    09/01/2023, 7:15 AM
    Hello everyone, I would like to know how to import a multi-level business vocabulary, such as having two term groups in the second level
  • f

    fierce-doctor-85079

    09/01/2023, 7:16 AM
    image.png
    a
    • 2
    • 1
  • f

    fierce-doctor-85079

    09/01/2023, 7:17 AM
    I will only recognize the last term group after importing through the above file. Do you know how to modify it? thanks
    h
    • 2
    • 2
  • b

    bland-orange-13353

    09/01/2023, 7:50 AM
    This message was deleted.
    a
    a
    • 3
    • 2
  • l

    late-addition-48515

    09/01/2023, 8:27 AM
    Hey everyone, I am trying to post dataset lineage using the rest emitter, but when I try setting two or more upstream urns only lineage to the last in the list of upstreams is posted. Any ideas why?
    Copy code
    def _post_lineage(self, parents, child):
            # Implement the API call here
    
            lineage_mce = builder.make_lineage_mce(
            [builder.make_dataset_urn("sbx-ml", 'dataset_1'), 
             builder.make_dataset_urn("sbx-ml", 'dataset_2')], # upstream
            builder.make_dataset_urn("sbx-ml", 'dataset_3),  # downstream
            )
            # Create an emitter to the GMS REST API.
            emitter = DatahubRestEmitter("<http://34:8080>")
            # Emit metadata!
            emitter.emit_mce(lineage_mce)
    r
    g
    • 3
    • 4
  • p

    purple-refrigerator-27989

    09/04/2023, 6:12 AM
    Hello everyone, I want to debug data-ingestion code to make clear how the code works when ingestion of mysql data, but I can't find the corresponding main program, only found mysql.py, can anyone help?
    h
    c
    • 3
    • 9
  • p

    purple-refrigerator-27989

    09/04/2023, 7:35 AM
    Hi everyone, I ran into some problems while looking at the source code and found that the from call did not recognize the "datahub" module
    r
    • 2
    • 3
  • f

    future-yak-13169

    09/05/2023, 2:18 AM
    Hi Guys - requesting assistance from anyone who can help figure out why we are consistently having elasticsearch indexing problems very frequently. The datahub version doesnt matter, we have been having this problem since v0.10 itself, we are now on 10.5. We have a total of about 50k datasets spread out ove multiple platforms. We have deployed datahub using helm charts, but the MySQL backend DB is outside the kubernetes cluster, its a managed service on-premise. The ELasticsearch is running within the k8s cluster in 3 replicas. Whenever we ingest new data, it doesnt show up on the UI. But its accessible via direct URL. We try running the restore indices job and it completes w/o errors, but no change on the UI. THere is no resource problem I see within the k8s cluster, elasticsearch,GMS have sufficient resources to work with. The elasticsearch is setup as both master and node according to the https://github.com/acryldata/datahub-helm/blob/master/charts/prerequisites/values.yaml
    datahub-frontend:
    image: repository: imagePullSecrets: - name: resources: requests: memory: 1Gi cpu: 500m limits: memory: 1Gi cpu: 500m datahub-gms: image: repository: imagePullSecrets: - name: resources: requests: cpu: 1000m memory: 2Gi limits: cpu: 1000m memory: 4Gi livenessProbe: initialDelaySeconds: 120 readinessProbe: initialDelaySeconds: 120 extraEnvs: - name: DATAHUB_TELEMETRY_ENABLED value: "false" - name: EBEAN_MAX_CONNECTIONS value: "400" - name: EBEAN_WAIT_TIMEOUT_MILLIS value: "9000" elasticsearchSetupJob: image: repository: resources: limits: cpu: 250m memory: 512Mi requests: cpu: 250m memory: 512Mi kafkaSetupJob: image: repository: resources: limits: cpu: 1000m memory: 1024Mi requests: cpu: 1000m memory: 1024Mi datahubUpgrade: enabled: true image: repository: imagePullSecrets: - name: resources: limits: cpu: 250m memory: 256Mi requests: cpu: 250m memory: 256Mi restoreIndices: resources: limits: cpu: 800m memory: 3Gi requests: cpu: 500m memory: 2Gi esJavaOpts: "-Xmx2048m -Xms2048m" datahubSystemUpdate: image: repository: podSecurityContext: {} securityContext: {} podAnnotations: {} resources: limits: cpu: 2000m memory: 2048Mi requests: cpu: 1000m memory: 1024Mi global: graph_service_impl: elasticsearch sql: datasource: host: hostForMysqlClient: url: username: password: secretRef: mysql-secrets secretKey: mysql-root-password kafka: schemaregistry: url: "http://prerequisites-cp-schema-registry:8081" type: KAFKA datahub: version: v0.10.4 metadata_service_authentication: enabled: true ------------------------------------------------------------------- elasticsearch: image: imagePullSecrets: - name: sysInitContainer: enabled: false sysctlInitContainer: enabled: false esJavaOpts: "-Xmx2048m -Xms2048m" replicas: 3 resources: requests: cpu: 100m memory: 2Gi limits: cpu: 200m memory: 4Gi livenessProbe: initialDelaySeconds: 120 readinessProbe: initialDelaySeconds: 120 kafka: global: imageRegistry: imagePullSecrets: - image: registry: repository: bitnami/kafka pullSecrets: resources: requests: cpu: 500m memory: 512Mi limits: cpu: 1000m memory: 1Gi livenessProbe: initialDelaySeconds: 120 readinessProbe: initialDelaySeconds: 120 persistence: enabled: true storageClass: "nas" accessModes: - ReadWriteOnce size: 200Gi
    r
    b
    • 3
    • 4
  • b

    busy-analyst-35820

    09/05/2023, 4:59 AM
    Hi Team, We face the below error "500 Unknown error" when we perform partial text search. We couldnt find anything specific in the ElasticsEarch log. Can you please help us here.
    r
    g
    • 3
    • 3
  • f

    fierce-doctor-85079

    09/05/2023, 5:17 AM
    Hello everyone, I would like to ask how to batch modify the business vocabulary through yml files
    r
    • 2
    • 1
  • f

    fierce-doctor-85079

    09/05/2023, 6:02 AM
    0.10.5
  • b

    bland-orange-95847

    09/05/2023, 6:11 AM
    Hi we are facing authorization issues with
    METADATA_SERVICE_AUTH
    enabled and group ownership policies. Its hard to describe but if you have some ideas in that area please have a look at this GitHub issue https://github.com/datahub-project/datahub/issues/8781 appreciate any help. For me it looks like at some place too many data is fetched and one indirection is not resolved correctly, but maybe I am missing something 🙂
    d
    • 2
    • 1
  • f

    future-yak-13169

    09/05/2023, 9:51 AM
    We keep getting this I/O Reactor STOPPED errors in GMS backend and it redeploys itself.
    Caused by: java.lang.RuntimeException: Request cannot be executed; I/O reactor status: STOPPED
    at org.elasticsearch.client.RestClient.extractAndWrapCause(RestClient.java:887)
    at org.elasticsearch.client.RestClient.performRequest(RestClient.java:283)
    at org.elasticsearch.client.RestClient.performRequest(RestClient.java:270)
    at org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1632)
    at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1602)
    at org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:1572)
    at org.elasticsearch.client.RestHighLevelClient.search(RestHighLevelClient.java:1088)
    at com.linkedin.metadata.search.elasticsearch.query.ESSearchDAO.executeAndExtract(ESSearchDAO.java:87)
    ... 13 common frames omitted
    Caused by: java.lang.IllegalStateException: Request cannot be executed; I/O reactor status: STOPPED
    at org.apache.http.util.Asserts.check(Asserts.java:46)
    at org.apache.http.impl.nio.client.CloseableHttpAsyncClientBase.ensureRunning(CloseableHttpAsyncClientBase.java:90)
    at org.apache.http.impl.nio.client.InternalHttpAsyncClient.execute(InternalHttpAsyncClient.java:123)
    at org.elasticsearch.client.RestClient.performRequest(RestClient.java:279)
    ... 19 common frames omitted
    Any advice on why this could be happening ? something to do with elasticsearch ?
    r
    b
    • 3
    • 7
  • q

    quick-pizza-8906

    09/05/2023, 11:19 AM
    Hello, I have a question related to some things introduced by
    0.10.x
    version of Datahub. I recently got hurt by
    searchAcrossEntities
    facets (buckets) count limited to 20. I found 2 related settings to this: 1. Environment variable
    ELASTICSEARCH_QUERY_MAX_TERM_BUCKET_SIZE
    2.
    searchFlags.maxAggValues
    input parameter for the query What I have noticed is that changing value of
    ELASTICSEARCH_QUERY_MAX_TERM_BUCKET_SIZE
    does not change actual limit of returned bucket count, while changing
    searchFlags.maxAggValues
    (at least in version
    0.10.5
    ) actually changes the bucket count limit. I am a bit confused what is intended relation between the env variable and the query input parameter? This puzzles me especially, considering the query builder: https://github.com/datahub-project/datahub/blob/master/metadata-io/src/main/java/c[…]ta/search/elasticsearch/query/request/SearchRequestHandler.java Does not use
    finalSearchFlags
    when building aggregations as well as does not seem to use parameters coming from
    ELASTICSEARCH_QUERY_MAX_TERM_BUCKET_SIZE
    . What am I missing here?
    r
    b
    f
    • 4
    • 4
  • m

    mysterious-advantage-78411

    09/05/2023, 1:19 PM
    Hi Guys! Could somebody share your database resourses requaired like cpu, drive and ram for your datahub instance? (of course total count of datasets in your datahub as reference metric). regulary receive "Unable to emit metadata to Data Hub GMS: com.datahub.util.exception.RetryLimitReached: Failed to add after 4 retries" but we have lattest version of action module. don't understood why it still happend. Today we try to reduce this incident by divorcing ingestions in runtime to avoid same time running. Anyway it seems like scalability issue due to we plan to add more than 500 ingestions it is ~10x grow from today. Thereafter it will be difficult to make ingestion divorcing by runtime. Any ideas?
    👀 1
    r
    • 2
    • 1
  • c

    chilly-potato-57465

    09/05/2023, 1:38 PM
    Hello! I am wondering how to implement the following. I would like to give users access equivalent to the Reader role to everything but datasets marked with a certain domain. For instance, I create a domain called Sensitive and annotate relevant datasets with it. Then the Reader role should be able to see all dataset with the exception of the datasets marked with Sensitive. As far as I understand, I have to implement a policy which explicitly gives access to everything else but those datasets. That is I can't create a policy to exclude certain domain but rather have to create a policy to include everything except it. Is this understanding correct? Thank you!
    b
    r
    +3
    • 6
    • 13
  • b

    big-nightfall-99541

    09/05/2023, 1:55 PM
    Hi everyone!! I'm trying to emit a custom lineage between an MlModelGroup and a MlFeatureTable, but I'm facing some problems when executing the script
    'Unable to emit metadata to DataHub GMS: java.lang.RuntimeException: Unknown aspect upstreamLineage for entity mlmodelgroup'
    What I'm doing wrong? [In the thread there is the full script and traceback] Thank you!
    r
    • 2
    • 3
  • g

    gentle-gold-63488

    09/05/2023, 3:53 PM
    Hi! How are you there? I've a question, how can i add Primary Key and Foreing Key on a S3 dataset?
    r
    • 2
    • 2
  • g

    gentle-gold-63488

    09/05/2023, 3:53 PM
    image.png
  • c

    colossal-football-58924

    09/05/2023, 5:51 PM
    Hello, I am using the quickstart guide to run datahub in docker. I get a series of error messages when I execute datahub docker quickstart. The first error is: Unable to connect to GitHub, using default quickstart version mapping config. Can someone please assist with this error. Thank you in advance.
    r
    • 2
    • 2
  • a

    able-library-93578

    09/05/2023, 7:28 PM
    Hello @witty-plumber-82249, I had a deployed working version of DataHub v0.10.5.5. After tinkering with several config setting for SSO, GMS authentication. I decided to get it back to the a normal state as deployed by my helm charts. I am getting issues in re-deployment on the prerequisites. More specifically mysql - startup probe. I get the below error. I am guessing it has something to do with the pvc, but not sure. I dont want to nuke everything as I have things I do not want to lose. Any guidance will be appreciated.
    r
    s
    • 3
    • 5
  • b

    broad-grass-53166

    09/05/2023, 9:24 PM
    Hi, I am looking to run and debug the metadata-ingestion in my local but unable to due to the below error.
    Traceback (most recent call last):
    File "/home/asaniya/code/datahub-master/metadata-ingestion/src/datahub/entrypoints.py", line 10, in <module>
    from datahub.cli.check_cli import check
    File "/home/asaniya/code/datahub-master/metadata-ingestion/src/datahub/cli/check_cli.py", line 13, in <module>
    from datahub.ingestion.run.pipeline import Pipeline
    File "/home/asaniya/code/datahub-master/metadata-ingestion/src/datahub/ingestion/run/pipeline.py", line 29, in <module>
    from datahub.ingestion.extractor.extractor_registry import extractor_registry
    File "/home/asaniya/code/datahub-master/metadata-ingestion/src/datahub/ingestion/extractor/extractor_registry.py", line 1, in <module>
    from datahub.ingestion.api.registry import PluginRegistry
    File "/home/asaniya/code/datahub-master/metadata-ingestion/src/datahub/ingestion/api/registry.py", line 18, in <module>
    import entrypoints
    File "/home/asaniya/code/datahub-master/metadata-ingestion/src/datahub/entrypoints.py", line 10, in <module>
    from datahub.cli.check_cli import check
    ImportError: cannot import name 'check' from partially initialized module 'datahub.cli.check_cli' (most likely due to a circular import) (/home/asaniya/code/datahub-master/metadata-ingestion/src/datahub/cli/check_cli.py)
    I have already tried following the steps noted in below documentation: https://datahubproject.io/docs/metadata-ingestion/developing/#requirements I am at a point where I am build the code but unable to run it. Ideally, I would like to run this in IntelliJ. Could you please help to resolve the above issue? CC: @hundreds-photographer-13496
    r
    • 2
    • 1
  • c

    clever-dinner-20353

    09/06/2023, 4:02 AM
    Hello, I'm currently facing an issue where
    inlets
    are not showing up on DataHub Here is the code
    Copy code
    task1 = BashOperator(
            task_id="run_data_task",
            dag=dag,
            bash_command="echo 'This is where you might run your data tooling.'",
            inlets=[
                Dataset(platform="snowflake", name="mydb.schema.tableA"),
                Dataset(platform="snowflake", name="mydb.schema.tableB", env="DEV"),
                Dataset(
                    platform="snowflake",
                    name="mydb.schema.tableC",
                    platform_instance="cloud",
                ),
                # You can also put dataset URNs in the inlets/outlets lists.
                Urn(
                    "urn:li:dataset:(urn:li:dataPlatform:snowflake,mydb.schema.tableC,PROD)"
                ),
            ],
            outlets=[Dataset("snowflake", "mydb.schema.tableD")],
        )
    and here is the lineage. It should show all the previous snowflake datasets..
    ✅ 1
    r
    • 2
    • 2
  • a

    able-library-93578

    09/06/2023, 10:21 PM
    Hi All, I have followed the steps for the datahub-actions "hello_world". Example. I have a datahub deployment in AKS through helm charts, so all of the default naming is still intact, nothing custom. I have OIDC active, and the
    METADATA_SERVICE_AUTH_ENABLED
    active as well. Below is my yaml for the action:
    Copy code
    # hello_world.yaml
    name: "hello_world"
    source:
      type: "kafka"
      config:
        connection:
          bootstrap: ${KAFKA_BOOTSTRAP_SERVER:-prerequisites-kafka:9092}
          schema_registry_url: ${SCHEMA_REGISTRY_URL:-<http://prerequisites-cp-schema-registry:8081>}
    filter:
      event_type: "EntityChangeEvent_v1"
      event:
        category: "TAG"
        operation: [ "ADD", "REMOVE" ]
        modifier: "urn:li:tag:SourcesSDP"
    action:
      type: "hello_world"
    datahub:
      server: "<https://my-datahub-domain.com/api/gms>"
      token: "my-token"
    Here is my logs from the cli:
    Copy code
    datahub actions -c hello_world.yaml                                           
    [2023-09-06 15:15:33,421] INFO     {datahub_actions.cli.actions:76} - DataHub Actions version: 0.0.13
    [2023-09-06 15:15:34,298] INFO     {datahub_actions.cli.actions:119} - Action Pipeline with name 'hello_world' is now running.
    %3|1694038534.460|FAIL|rdkafka#consumer-1| [thrd:prerequisites-kafka:9092/bootstrap]: prerequisites-kafka:9092/bootstrap: Failed to resolve 'prerequisites-kafka:9092': nodename nor servname provided, or not known (after 179ms in state CONNECT)
    %3|1694038536.289|FAIL|rdkafka#consumer-1| [thrd:prerequisites-kafka:9092/bootstrap]: prerequisites-kafka:9092/bootstrap: Failed to resolve 'prerequisites-kafka:9092': nodename nor servname provided, or not known (after 3ms in state CONNECT, 1 identical error(s) suppressed)
    %3|1694038567.357|FAIL|rdkafka#consumer-1| [thrd:prerequisites-kafka:9092/bootstrap]: prerequisites-kafka:9092/bootstrap: Failed to resolve 'prerequisites-kafka:9092': nodename nor servname provided, or not known (after 3ms in state CONNECT, 16 identical error(s) suppressed)
    %3|1694038597.424|FAIL|rdkafka#consumer-1| [thrd:prerequisites-kafka:9092/bootstrap]: prerequisites-kafka:9092/bootstrap: Failed to resolve 'prerequisites-kafka:9092': nodename nor servname provided, or not known (after 3ms in state CONNECT, 15 identical error(s) suppressed)
    %3|1694038627.493|FAIL|rdkafka#consumer-1| [thrd:prerequisites-kafka:9092/bootstrap]: prerequisites-kafka:9092/bootstrap: Failed to resolve 'prerequisites-kafka:9092': nodename nor servname provided, or not known (after 3ms in state CONNECT, 15 identical error(s) suppressed)
    %3|1694038657.561|FAIL|rdkafka#consumer-1| [thrd:prerequisites-kafka:9092/bootstrap]: prerequisites-kafka:9092/bootstrap: Failed to resolve 'prerequisites-kafka:9092': nodename nor servname provided, or not known (after 3ms in state CONNECT, 15 identical error(s) suppressed)
    %3|1694038687.640|FAIL|rdkafka#consumer-1| [thrd:prerequisites-kafka:9092/bootstrap]: prerequisites-kafka:9092/bootstrap: Failed to resolve 'prerequisites-kafka:9092': nodename nor servname provided, or not known (after 3ms in state CONNECT, 15 identical error(s) suppressed)
    %3|1694038717.702|FAIL|rdkafka#consumer-1| [thrd:prerequisites-kafka:9092/bootstrap]: prerequisites-kafka:9092/bootstrap: Failed to resolve 'prerequisites-kafka:9092': nodename nor servname provided, or not known (after 3ms in state CONNECT, 15 identical error(s) suppressed)
    %3|1694038747.775|FAIL|rdkafka#consumer-1| [thrd:prerequisites-kafka:9092/bootstrap]: prerequisites-kafka:9092/bootstrap: Failed to resolve 'prerequisites-kafka:9092': nodename nor servname provided, or not known (after 3ms in state CONNECT, 15 identical error(s) suppressed)
    %3|1694038778.840|FAIL|rdkafka#consumer-1| [thrd:prerequisites-kafka:9092/bootstrap]: prerequisites-kafka:9092/bootstrap: Failed to resolve 'prerequisites-kafka:9092': nodename nor servname provided, or not known (after 3ms in state CONNECT, 16 identical error(s) suppressed)
    %3|1694038809.898|FAIL|rdkafka#consumer-1| [thrd:prerequisites-kafka:9092/bootstrap]: prerequisites-kafka:9092/bootstrap: Failed to resolve 'prerequisites-kafka:9092': nodename nor servname provided, or not known (after 3ms in state CONNECT, 16 identical error(s) suppressed)
    ^C[2023-09-06 15:20:31,393] INFO     {datahub_actions.cli.actions:137} - Stopping all running Action Pipelines...
    [2023-09-06 15:20:32,803] INFO     {datahub_actions.plugin.source.kafka.kafka_event_source:178} - Kafka consumer exiting main loop
    [2023-09-06 15:20:32,804] INFO     {datahub_actions.pipeline.pipeline_manager:81} - Actions Pipeline with name 'hello_world' has been stopped.
    
    Pipeline Report for hello_world
    
    Started at: 2023-09-06 15:15:34.297000 (Local Time)
    Duration: 298.508s
    
    Pipeline statistics
    
    {
        "started_at": 1694038534297
    }
    
    Action statistics
    
    {}
    Any advice on what to tweak is greatly appreciated.
    r
    g
    a
    • 4
    • 9
  • b

    best-laptop-39921

    09/07/2023, 2:00 AM
    Hello, I've recently upgraded DataHub from version 0.10.2 to 0.10.5. However, I've encountered an issue when attempting advanced queries in the web UI, such as (
    \q fieldPaths: column_name
    ), as it doesn't work. Only
    \q name:
    works. Any advice would be greatly appreciated. Thank you. :) (I used helm chart. --> any settings for advanced query...?)
    r
    b
    a
    • 4
    • 7
  • q

    quiet-arm-91745

    09/07/2023, 8:09 AM
    is datahub helm chart expose annotation block? as i want to add this annotation to be able use GCS FUSE CSI
    Copy code
    metadata:
      annotations:
        gke-gcsfuse/volumes: "true"
    otherwise i can't mount gcs bucket as volume thanks in advance
    r
    b
    • 3
    • 5
  • b

    bitter-florist-92385

    09/07/2023, 8:31 AM
    Hey there, i am currently trying to install the python sdk. I installed the datahub package via pip. But when i try to import modules like :
    Copy code
    from datahub import DataHubClient, MetadataChangeEvent
    i get an Import Error. Is the package not complete, or am i missing something else ?
    r
    • 2
    • 1
  • m

    mysterious-advantage-78411

    09/07/2023, 9:34 AM
    Hi Guys, is there a way to increase s3 timeout in ingestion to avoid this error: botocore.exceptions.ReadTimeoutError: Read timeout on endpoint URL: .....? some backet can not be investigated due to this error.
    r
    m
    +2
    • 5
    • 8
1...106107108...119Latest