DataHub #troubleshoot

plain-analyst-30186

02/05/2023, 2:26 PM

Hi Team, In my case, it is too slow when searching for keywords. (datahub-version 0.8.32) What is the query statement used in Elasticsearch when searching for keywords? I want to test individual queries in Kibana like this. GET _search { "query": { "multi_match": { "query" : "test_feature" } } } Thanks.

best-umbrella-88325

02/06/2023, 10:44 AM

Hello community! Creating a PR for contribution for the very first time, so new to running the unit tests on the DataHub project. I've made a few changes and am running the unit test as highlighted in https://datahubproject.io/docs/metadata-ingestion/developing/#testing. When I run these tests, 5 of the tests fail with errors.. Not sure what's happening or happens under the hood. Can someone please help me out here? Test output in 🧵

strong-easter-55319

02/06/2023, 11:19 AM

Hi there, I'm trying to mount a local environment by following the steps listed here , but once I tried to run

Copy code

./gradlew build

I get the following error:

no such file or directory: ./gradlew

In fact, this directory does not exist on the root of the project, is there any missing step on the documentation that I have to look into it?

✅ 1

strong-easter-55319

02/06/2023, 12:34 PM

Hi there, I'm facing the following error after run

./gradlew build

Copy code

FAILURE: Build failed with an exception.

* Where:
Build file '/Users/gamboad/Sites/datahub/metadata-service/restli-servlet-impl/build.gradle' line: 80

* What went wrong:
A problem occurred evaluating project ':metadata-service:restli-servlet-impl'.
> Could not resolve all dependencies for configuration ':metadata-service:restli-servlet-impl:dataModel'.
   > Failed to calculate the value of task ':metadata-models:compileJava' property 'javaCompiler'.
      > Unable to download toolchain matching these requirements: {languageVersion=8, vendor=any, implementation=vendor-specific}
         > Unable to download toolchain. This might indicate that the combination (version, architecture, release/early access, ...) for the requested JDK is not available.
            > Could not read '<https://api.adoptopenjdk.net/v3/binary/latest/8/ga/mac/aarch64/jdk/hotspot/normal/adoptopenjdk>' as it does not exist.

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output. Run with --scan to get full insights.

* Get more help at <https://help.gradle.org>

Deprecated Gradle features were used in this build, making it incompatible with Gradle 7.0.
Use '--warning-mode all' to show the individual deprecation warnings.
See <https://docs.gradle.org/6.9.2/userguide/command_line_interface.html#sec:command_line_warnings>

BUILD FAILED in 1m 58s

My java version is 11

Copy code

$ java --version


openjdk 11.0.18 2023-01-17
OpenJDK Runtime Environment Homebrew (build 11.0.18+0)
OpenJDK 64-Bit Server VM Homebrew (build 11.0.18+0, mixed mode)

The issue seems the be related with the gradle version, which version should I use?

powerful-cat-68806

02/06/2023, 3:04 PM

Hi team, I have a few recipes that are scheduled, for the ingestion process, at the same time (E.g. -

0 0 * * *

) Does it make sense that if they run simultaneously, they’ll fail? Also - can I kill a running process?

sparse-memory-36759

02/06/2023, 3:56 PM

Good morning. I have question regarding Kafka stuff for Datahub. Currently I don't have opportunity to use Kafka on AWS (msk) or GCP, I can use only remote Kafka on Aiven and there I have difficulties with authentication. Aiven represent authentication thru certs but I didn't find any ways for use this type authentication for Datahub deployment. ( https://github.com/acryldata/datahub-helm ) Question, how I can avoid this issue, is it possible to pass authentication creds to Datahub ???

handsome-football-66174

02/06/2023, 5:26 PM

Hi Team, Using the 0.9.3 version of Datahub, trying to Publish S3 Datasets, The pipeline runs but unable to view the Dataset in UI. I dont see any logs in the GMS pod ( Datahub is deployed on EKS )

salmon-spring-51500

02/06/2023, 7:12 PM

Hi Team, I am facing issue will starting datahub. It says- datahub-gms is running by not yet healthy. How to fix this issue?

👀 1

jolly-gpu-90313

02/07/2023, 6:07 AM

Hi, I'm new to DataHub and trying to set it up locally on my Mac. I'm using Colima as a docker desktop replacement. I get the following error when I run:

datahub docker quickstart

sudo datahub docker quickstart

Copy code

> sudo datahub docker quickstart

Docker doesn't seem to be running. Did you start it?

~ .........................................................................................................................................................................
> datahub docker quickstart
Docker doesn't seem to be running. Did you start it?

Can someone help me figure this out? My

docker

setup is running.

powerful-cat-68806

02/07/2023, 9:03 AM

Hi team, I have manual execution in pending status for a long time How & from which pod I can I kill it?

gray-ocean-32209

02/07/2023, 10:19 AM

Hello Team, Is there a way to populate “Documentation” information for Airflow DAG at the task level? Similar to the description field Airflow DAGs is used to fill up documentation at the pipeline level. We use DataHub Airflow lineage plugin for Airflow Datahub integration Datahub version 0.9.5 Airflow v2.3.4

✅ 1

average-dinner-25106

02/07/2023, 10:39 AM

Hi, I am trying to check the versioning of DataHub. To test this, I made the table named 'datahub_test' with three columns (account_id, asset_amount, asset_name) and ingested with hive. Therefore, those three columns are written 'Added in v0.0.0'. Next, I added two columns (avg_price, quantity) to the same table and ingested again. What I expected is that UI provides the two versions (v0.0.0, v1.0.0) and can see the schema according to the version. However, when I clicked Schema tab, there no existed the box where version appeared and two columns had no message (expected msg is added in v1.0.0) What's wrong with it? As far as I know, when technical schema change occurs (such as adding new columns), version is incremented by 1 and DataHub recognized it automatically when ingesting.

billowy-flag-4217

02/07/2023, 1:37 PM

Hello - I've just been testing out the

include_view_lineage

for the postgres ingestion library. I have noticed that it doesn't always emit correct lineage between my entities. It seems mostly to be lineage between view and view, where one view has an upstream dependency to another, where lineage is ignored. Is this by design or should we expect

include_view_lineage

to include lineage between views and views too?

incalculable-manchester-41314

02/07/2023, 2:47 PM

Having this err or when run custom pipeline File "/home/mardanov/my_test/src/test.py", line 4, in <module> pipeline = Pipeline( File "/home/mardanov/my_test/venv/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 161, in init with _add_init_error_context("set up framework context"): File "/usr/lib/python3.10/contextlib.py", line 153, in exit self.gen.throw(typ, value, traceback) File "/home/mardanov/my_test/venv/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 116, in _add_init_error_context raise PipelineInitError(f"Failed to {step}: {e}") from e datahub.ingestion.run.pipeline.PipelineInitError: Failed to set up framework context: 'dict' object has no attribute 'run_id'

melodic-ability-49840

02/07/2023, 4:32 PM

Hi, Teams. I want to push the airflow task information to the datahub to create a lineage. https://datahubproject.io/docs/docker/airflow/local_airflow I followed all the guides in this document, and my DAG succeeded. However, pipelines are not visible in the datahub UI. What should I do? my dag is like:

Copy code

from datetime import timedelta

from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.utils.dates import days_ago

from datahub_provider.entities import Dataset

default_args = {
    "owner": "airflow",
    "depends_on_past": False
}


with DAG(
    "datahub_lineage_backend_demo",
    default_args=default_args,
    description="An example DAG demonstrating the usage of DataHub's Airflow lineage backend.",
    schedule_interval=timedelta(days=1),
    start_date=days_ago(2),
    tags=["example_tag"],
    catchup=False,
) as dag:
    task1 = BashOperator(
        task_id="run_data_task",
        dag=dag,
        bash_command="echo 'This is where you might run your data tooling.'",
        inlets=[
            Dataset("snowflake", "mydb.schema.tableA"),
            Dataset("snowflake", "mydb.schema.tableB", "DEV")
        ],
        outlets=[Dataset("snowflake", "mydb.schema.tableD")],
    )

and also:

Copy code

from datetime import timedelta
from airflow.operators.bash import BaseOperator, BashOperator

from airflow import DAG
from airflow.utils.dates import days_ago

import datahub.emitter.mce_builder as builder
from datahub_provider.operators.datahub import DatahubEmitterOperator

default_args = {
    "owner": "airflow",
    "depends_on_past": False,
    "retries": 1,
    "retry_delay": timedelta(minutes=5),
    "execution_timeout": timedelta(minutes=120),
}


with DAG(
        "datahub_lineage_emission_test",
        default_args=default_args,
        description="An example DAG demonstrating lineage emission within an Airflow DAG.",
        schedule_interval=timedelta(days=1),
        start_date=days_ago(2),
        catchup=False,
) as dag:
    # This example shows a SnowflakeOperator followed by a lineage emission. However, the
    # same DatahubEmitterOperator can be used to emit lineage in any context.

    lineage_dag_start = BashOperator(
        task_id="LINEAGE_START",
        dag=dag,
        bash_command="echo 'This is lineage test Start DAG'"
    )

    emit_lineage_task = DatahubEmitterOperator(
        task_id="emit_lineage",
        datahub_conn_id="datahub_rest_default",
        mces=[
            builder.make_lineage_mce(
                upstream_urns=[
                    builder.make_dataset_urn("s3", "mydb.schema.tableA"),
                    builder.make_dataset_urn("s3", "mydb.schema.tableB"),
                ],
                downstream_urn=builder.make_dataset_urn(
                    "s3", "mydb.schema.tableC"
                ),
            )
        ],
        dag=dag
    )

    lineage_finish = BashOperator(
        task_id="LINEAGE_FINISH",
        dag=dag,
        bash_command="echo 'This is lineage test Finish DAG'"
    )

    lineage_dag_start >> emit_lineage_task >> lineage_finish

rhythmic-quill-75064

02/07/2023, 4:33 PM

Hi team. Still gradually ramping up to version. This time, I'm having an issue with version 0.2.120... logs of

datahub-datahub-gms

Copy code

[R2 Nio Event Loop-1-1] WARN  c.l.r.t.h.c.c.ChannelPoolLifecycle:139 - Failed to create channel, remote=localhost/127.0.0.1:8080
io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: localhost/127.0.0.1:8080
Caused by: java.net.ConnectException: Connection refused
[...]
[pool-16-thread-1] ERROR c.d.m.ingestion.IngestionScheduler:244 - Failed to retrieve ingestion sources! Skipping updating schedule cache until next refresh. start: 0, count: 30
com.linkedin.r2.RemoteInvocationException: com.linkedin.r2.RemoteInvocationException: Failed to get response from server for URI <http://localhost:8080/entities>
[...]
Caused by: com.linkedin.r2.RetriableRequestException: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: localhost/127.0.0.1:8080

✅ 1

gray-ghost-82678

02/07/2023, 5:25 PM

Hello I am currently trying to connect a Microsoft SQL Server to Datahub but I am getting “No module named ‘Pyodbc’”. I have ODBC Drive 17 installed. I have reinstalled Pyodbc but it does nothing. I am using Windows. Thanks!

salmon-jordan-53958

02/07/2023, 5:30 PM

Hi, any thoughts on the error below: kafka.common.InconsistentClusterIdException: The Cluster ID VqMGuzTLRJKCcUfI7vq91A doesn't match stored clusterId Some(pQUnNhLwSAiX_Qx842-AtA) in meta.properties. The broker is trying to join the wrong cluster. Configured zookeeper.connect may be wrong.

green-hamburger-3800

02/07/2023, 5:31 PM

Hey folks, is it possible to create a Custom Transformer for an entity such a Container? I'm having some trouble with some fields being overwriten by the Trino Connector and wanted to avoid that by using a transformer, but it seems that's only possible in Dasets, Datajob or Dataflow as per here https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/transformer/base_transformer.py#L59-L63! Thanks~

plus1 1

bland-orange-13353

02/07/2023, 5:39 PM

This message was deleted.

salmon-jordan-53958

02/07/2023, 11:17 PM

Hi, I am self hosting Datahub via docker. Basic setup using AWS postgres & Aws elastic search (via opensearch). However my elastic search instance has a password, is there an environment variable for the datahub-upgrade container. This might have been missed out of the documentation as there is no mention of it and I am getting errors. I would have assumed it to be something like :- ELASTICSEARCH_USERNAME=${ELASTICSEARCH_USERNAME} - ELASTICSEARCH_PASSWORD=${ELASTICSEARCH_PASSWORD}

👀 1

✅ 1

quaint-barista-82836

02/08/2023, 3:46 AM

Hi All, I am in process to integrate the datahub with great expectations on composer 2 for bigquery. I can see the bigquery table is available on datahub however when transmitting it to datahub for bigquery getting below, but it doesnt reflect on the bigquery table on datahub, any suggestions: The table name for the urn at bigquery is "FBT_Diff" however when transmitting it lowercases the table table name "fbt_diff"

Copy code

Calculating Metrics:   0%|          | 0/15 [00:00<?, ?it/s]
Calculating Metrics:   0%|          | 0/15 [00:00<?, ?it/s]
Calculating Metrics:  13%|█▎        | 2/15 [00:00<00:01, 12.15it/s]
Calculating Metrics:  13%|█▎        | 2/15 [00:00<00:01, 12.15it/s]
Calculating Metrics:  13%|█▎        | 2/15 [00:00<00:01, 12.15it/s]
Calculating Metrics:  27%|██▋       | 4/15 [00:02<00:08,  1.34it/s]
Calculating Metrics:  27%|██▋       | 4/15 [00:02<00:08,  1.34it/s]
Calculating Metrics:  27%|██▋       | 4/15 [00:02<00:08,  1.34it/s]
Calculating Metrics:  47%|████▋     | 7/15 [00:02<00:05,  1.34it/s]
Calculating Metrics:  47%|████▋     | 7/15 [00:02<00:05,  1.34it/s]
Calculating Metrics:  80%|████████  | 12/15 [00:05<00:01,  2.14it/s]
Calculating Metrics:  80%|████████  | 12/15 [00:05<00:01,  2.14it/s]
Calculating Metrics:  80%|████████  | 12/15 [00:05<00:01,  2.14it/s]
Calculating Metrics: 100%|██████████| 15/15 [00:07<00:00,  1.86it/s]
Calculating Metrics: 100%|██████████| 15/15 [00:07<00:00,  1.86it/s]
Calculating Metrics: 100%|██████████| 15/15 [00:07<00:00,  1.86it/s]
Calculating Metrics: 100%|██████████| 15/15 [00:07<00:00,  1.86it/s]
Calculating Metrics: 100%|██████████| 15/15 [00:07<00:00,  1.93it/s]
[2023-02-08, 01:07:57 EST] {subprocess.py:89} INFO - Finding datasets being validated
[2023-02-08, 01:07:57 EST] {subprocess.py:89} INFO - Datasource my_bigquery_datasource is not present in platform_instance_map
[2023-02-08, 01:07:57 EST] {subprocess.py:89} INFO - GE expectation_suite_name - demo, expectation_type - expect_column_values_to_not_be_null, Assertion URN - urn:li:assertion:6f56acc887e38af0561eaeb8d41b0bdb
[2023-02-08, 01:07:57 EST] {subprocess.py:89} INFO - GE expectation_suite_name - demo, expectation_type - expect_column_values_to_be_between, Assertion URN - urn:li:assertion:aa04dc0fc98f145d01ae9fcd5f7f4ee3
[2023-02-08, 01:07:57 EST] {subprocess.py:89} INFO - Sending metadata to datahub ...
[2023-02-08, 01:07:57 EST] {subprocess.py:89} INFO - Dataset URN - urn:li:dataset:(urn:li:dataPlatform:bigquery,project.dataset.fbt_diff,PROD)
[2023-02-08, 01:07:57 EST] {subprocess.py:89} INFO - Assertion URN - urn:li:assertion:6f56acc887e38af0561eaeb8d41b0bdb
[2023-02-08, 01:07:57 EST] {subprocess.py:89} INFO - Assertion URN - urn:li:assertion:aa04dc0fc98f145d01ae9fcd5f7f4ee3
[2023-02-08, 01:07:57 EST] {subprocess.py:89} INFO - Metadata sent to datahub.
[2023-02-08, 01:07:57 EST] {subprocess.py:89} INFO - Validation succeeded!
[2023-02-08, 01:07:57 EST] {subprocess.py:89} INFO - 
[2023-02-08, 01:07:57 EST] {subprocess.py:89} INFO - Suite Name                                   Status     Expectations met
[2023-02-08, 01:07:57 EST] {subprocess.py:89} INFO - - demo                                       ✔ Passed   2 of 2 (100.0 %)
[2023-02-08, 01:07:59 EST] {subprocess.py:93} INFO - Command exited with return code 0

passing Datasource my_bigquery_datasource is not present in platform_instance_map value as :
action_list:
  - name: store_validation_result
    action:
      class_name: StoreValidationResultAction
  - name: store_evaluation_params
    action:
      class_name: StoreEvaluationParametersAction
  - name: update_data_docs
    action:
      class_name: UpdateDataDocsAction
      site_names: []
  - name: datahub_action
    action:
      module_name: datahub.integrations.great_expectations.action
      class_name: DataHubValidationAction
      server_url: <http://ip_address:8080>
      platform_instance_map:
         datasource_name: my_bigquery_datasource
      parse_table_names_from_sql: true

magnificent-lock-58916

02/08/2023, 9:31 AM

Updated to 0.9.6.1. version and tried to edit Lineage through Web UI (more specifically, through visualised lineage) At first it seems to be working, I save changes and new connections are displayed on the screen. But once I refresh the tab, all of the changes disappear What can possibly be the problem?

shy-keyboard-55519

02/08/2023, 10:16 AM

Hi, have a problem with upgrade of DataHub to v0.10.0 from v0.9.6.1, can anyone help me investigate where the problem might be? I have DataHub deployed in Kubernetes, after upgrading all images to v0.10.0 will

datahub-gms

ends up crashlooping. I also tried increasing

datahub-gms

versions by patch versions, and this behavior starts between v0.9.6.3 and v0.9.6.4. I can provide any necessary logs upon request.

fierce-garage-74290

02/08/2023, 11:55 AM

Migration of business glossaries from Confluence to DataHub My client has plenty of glossaries (100+) defined on Confluence pages but would like to connect them somehow to data assets ingested to DataHub (dbt models, Snowflake tables, etc.). A couple of things to consider: • putting all definitions into yml and ingesting them to DH via recipes might be problematic as the glossaries most likely will be maintained by non-technical people • they also complained about some limitations of having all the definitions in yml (formatting etc.) • so maybe we should keep content as-is in Confluence and in DataHub's recipes we should only keep glossary entities and their relationship with each other plus

url

to corresponding Confluence pages? What would be your best bet here (or some good practices based on experience)? Thanks!

polite-honey-55441

02/08/2023, 12:26 PM

Hi, We would like to name our Datasets aligning to Domains and not as an Environment or Platform-based one. Is there a way to add new Dataset? And if I want to change the name of the existing Dataset, is it enough to change it in the identifier urnlidataset:(urnlidataPlatform:redshift,userdb.public.customer_table,PROD) ? In which file will I be able to find this URN identifier of the Dataset? Thanks! 🙂

lively-spring-5482

02/08/2023, 3:04 PM

Hi, I’m experiencing issues with user provisioning functionality in GMS. Having successfully set up OIDC authentication in our system I see most users being able to login successfully and use the DataHub UI. As it comes, I’m the unfortunate exception with my first name containing a localised character. The information passed from the authenticator to Datahub doesn’t allow the latter to provision my user and grant me a URN. These are the symptoms: UI shows

Copy code

Failed to perform post authentication steps. Error message: Failed to provision user with urn

Frontend throws the following exception:

Copy code

Caused by: com.linkedin.r2.message.rest.RestException: Received error 500 from server for URI <http://datahub-datahub-gms:8080/entities>

The gms application complains:

Copy code

Caused by: java.sql.SQLException: Incorrect string value: '\xC5\x82aw F...' for column 'metadata' at row 1

Obviously, UTF-8 “ł” character handling issue. What can be done in this situation? And no…, I’m not that much into changing my name, Jaros*ł*aw is not that bad ;) Thanks in advance for your suggestions!

quaint-barista-82836

02/08/2023, 5:48 PM

Hi Team, Reposting the issue: I am trying to integrate the datahub with great expectations with below url: https://docs.greatexpectations.io/docs/integrations/integration_datahub The challenge I am getting into is when transmitting data to datahub, the bigquery only table_name urn is set to lower case and actually its upper case, because of it its transmitting assertions but not mapping it to table because of incorrect urn GX: urnlidataset:(urnlidataPlatform:bigquery,project_name.FBT.fbt_diff,PROD) Datahub: urnlidataset:(urnlidataPlatform:bigquery,project_name.FBT.FBT_Diff,PROD) with lowercase urn of datahub it generate urnlidataset:(urnlidataPlatform:bigquery,project_name.fbt.fbt_diff,PROD) but that still not able to match the urn, any workaround for the issue ? https://datahubproject.io/docs/metadata-ingestion/integration_docs/great-expectations/ docs.greatexpectations.io Integrating DataHub With Great Expectations | Great Expectations * Maintained By: @witty-plumber-82249

able-evening-90828

02/08/2023, 11:31 PM

How should I query all data platform instances through graphql? The following query doesn't return anything.

Copy code

query searchDataPlatformInstance {
  searchAcrossEntities(
    input: {types: [DATA_PLATFORM_INSTANCE], query: "", start: 0, count: 1000}
  ) {
    start
    count
    total
    searchResults {
      entity {
        urn
        type
      }
    }
  }
}

✅ 1

able-evening-90828

02/09/2023, 1:17 AM

How can I search for datasets with a filter on a field that matches the field against one of a list of values? The query below only matches tags

urn:li:tag: user identifier

, but I want to match either

urn:li:tag:user identifier

urn:li:tag:email address

Copy code

query getSearchResultsForMultiple {
  searchAcrossEntities(input: {
    types: [DATASET], 
    query: "",
    start: 0,
    count:1000,
    orFilters: [
      {
        and: [
          {
            field: "fieldTags",
            values: ["urn:li:tag:user identifier", "urn:li:tag:email address"],
            condition: EQUAL
          }
        ]
      }
    ]
  }) {
    start
    count
    total
    searchResults {
      entity {
        urn
        type
      }
    }
  }
}

✅ 1