DataHub #troubleshoot

proud-waitress-17589

05/16/2023, 5:08 PM

Reviving an old thread - is it possible to delete based on glossaryTermGroup? ie. I would like to remove a large branch from my glossary that was populated using the Glossary ingestion in order to rerun ingestion for that sub-tree, but do not want to delete the whole Glossary.

rich-state-73859

05/17/2023, 4:32 PM

Is there any update for this issue?

astonishing-father-13229

05/18/2023, 7:14 PM

Can someone help me ?

adamant-furniture-37835

05/23/2023, 7:56 AM

Hi @astonishing-answer-96712,Apologies for delayed response. I didn't notice any error message on dev tools at UI. Maybe I haven't understood the feature or my expectations are different. Here is the scenario and queries : 1. I created a View with filter "platform of type Vertica or Tableau" and made it default view. a. When I login to homepage, it shows me everything i.e. all entity types, all platforms. I can see in dev tools that a graphql call is done to fetch View details but results aren't filtered. Isn't it that landing page should only show what default view allows ? 2. On Home page if I click on any other platform type, let's say Snowflake, it shows the message : No results found for "" . This is good but the unwanted platforms shouldn't be shown in first place, right ? 3. Under explore your data, I am able to navigate to all entity types and look at their details even though View is selected in the top penal. Our expectations is nothing should be shown that falls outside view definition. Please provide your opinion if it's a bug or part of the feature itself. Thanks, Mahesh

future-analyst-98466

05/31/2023, 6:42 AM

@few-air-34037 how to pin/ lock sqlparse ver 0.4.3? tks!

helpful-dream-67192

06/02/2023, 8:27 AM

We are trying to deploy latest version of datahub via helm. Getting same error in datahub-gms pod,

Copy code

2023-06-02 08:21:55,927 [ThreadPoolTaskExecutor-1] WARN  c.l.m.b.k.DataHubUpgradeKafkaListener:99 - System version is not up to date: v0.10.3-0. Waiting for datahub-upgrade to complete...
2023-06-02 08:21:56,093 [pool-20-thread-1] WARN  org.elasticsearch.client.RestClient:65 - request [POST <http://elasticsearch-master:9200/datahub_usage_event/_search?typed_keys=true&max_concurrent_shard_requests=5&ignore_unavailable=false&expand_wildcards=open&allow_no_indices=true&ignore_throttled=true&search_type=query_then_fetch&batched_reduce_size=512&ccs_minimize_roundtrips=true>] returned 2 warnings: [299 Elasticsearch-7.17.3-5ad023604c8d7416c9eb6c0eadb62b14e766caff "Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See <https://www.elastic.co/guide/en/elasticsearch/reference/7.17/security-minimal-setup.html> to enable security."],[299 Elasticsearch-7.17.3-5ad023604c8d7416c9eb6c0eadb62b14e766caff "[ignore_throttled] parameter is deprecated because frozen indices have been deprecated. Consider cold or frozen tiers in place of frozen indices."]
2023-06-02 08:21:56,112 [pool-20-thread-1] WARN  org.elasticsearch.client.RestClient:65 - request [POST <http://elasticsearch-master:9200/datahub_usage_event/_search?typed_keys=true&max_concurrent_shard_requests=5&ignore_unavailable=false&expand_wildcards=open&allow_no_indices=true&ignore_throttled=true&search_type=query_then_fetch&batched_reduce_size=512&ccs_minimize_roundtrips=true>] returned 2 warnings: [299 Elasticsearch-7.17.3-5ad023604c8d7416c9eb6c0eadb62b14e766caff "Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See <https://www.elastic.co/guide/en/elasticsearch/reference/7.17/security-minimal-setup.html> to enable security."],[299 Elasticsearch-7.17.3-5ad023604c8d7416c9eb6c0eadb62b14e766caff "[ignore_throttled] parameter is deprecated because frozen indices have been deprecated. Consider cold or frozen tiers in place of frozen indices."]
2023-06-02 08:21:56,117 [pool-20-thread-1] WARN  org.elasticsearch.client.RestClient:65 - request [POST <http://elasticsearch-master:9200/datahub_usage_event/_search?typed_keys=true&max_concurrent_shard_requests=5&ignore_unavailable=false&expand_wildcards=open&allow_no_indices=true&ignore_throttled=true&search_type=query_then_fetch&batched_reduce_size=512&ccs_minimize_roundtrips=true>] returned 2 warnings: [299 Elasticsearch-7.17.3-5ad023604c8d7416c9eb6c0eadb62b14e766caff "Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See <https://www.elastic.co/guide/en/elasticsearch/reference/7.17/security-minimal-setup.html> to enable security."],[299 Elasticsearch-7.17.3-5ad023604c8d7416c9eb6c0eadb62b14e766caff "[ignore_throttled] parameter is deprecated because frozen indices have been deprecated. Consider cold or frozen tiers in place of frozen indices."]
2023-06-02 08:21:56,394 [I/O dispatcher 1] WARN  org.elasticsearch.client.RestClient:65 - request [POST <http://elasticsearch-master:9200/_bulk?timeout=1m>] returned 1 warnings: [299 Elasticsearch-7.17.3-5ad023604c8d7416c9eb6c0eadb62b14e766caff "Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See <https://www.elastic.co/guide/en/elasticsearch/reference/7.17/security-minimal-setup.html> to enable security."]
2023-06-02 08:21:56,402 [I/O dispatcher 1] INFO  c.l.m.s.e.update.BulkListener:47 - Successfully fed bulk request. Number of events: 1 Took time ms: -1
2023-06-02 08:22:02,937 [pool-12-thread-1] WARN  org.elasticsearch.client.RestClient:65 - request [POST <http://elasticsearch-master:9200/datahubpolicyindex_v2/_search?typed_keys=true&max_concurrent_shard_requests=5&ignore_unavailable=false&expand_wildcards=open&allow_no_indices=true&ignore_throttled=true&search_type=query_then_fetch&batched_reduce_size=512&ccs_minimize_roundtrips=true>] returned 2 warnings: [299 Elasticsearch-7.17.3-5ad023604c8d7416c9eb6c0eadb62b14e766caff "Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See <https://www.elastic.co/guide/en/elasticsearch/reference/7.17/security-minimal-setup.html> to enable security."],[299 Elasticsearch-7.17.3-5ad023604c8d7416c9eb6c0eadb62b14e766caff "[ignore_throttled] parameter is deprecated because frozen indices have been deprecated. Consider cold or frozen tiers in place of frozen indices."]
2023-06-02 08:22:38,508 [R2 Nio Event Loop-1-1] WARN  c.l.r.t.h.c.c.ChannelPoolLifecycle:139 - Failed to create channel, remote=localhost/127.0.0.1:8080
io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: localhost/127.0.0.1:8080
Caused by: java.net.ConnectException: Connection refused
	at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:777)
	at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:337)
	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:776)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at java.base/java.lang.Thread.run(Thread.java:829)
2023-06-02 08:22:40,615 [R2 Nio Event Loop-1-2] WARN  c.l.r.t.h.c.c.ChannelPoolLifecycle:139 - Failed to create channel, remote=localhost/127.0.0.1:8080
io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: localhost/127.0.0.1:8080

Can someone help here? Thanks in advance. cc: @proud-dusk-671 @millions-football-58938 @brainy-beach-58125

plus1 1

fast-vegetable-81275

06/02/2023, 2:45 PM

I tried this way referring to the docs, it didn't work. Does this s3 method need a spark and hadoop version set up on my machine?

cuddly-butcher-39945

06/06/2023, 3:57 PM

I am also having issues with a gms deployment not finishing with helm. I am deploying helm chart release 2.161 onto EKS on fargate with AWS RDS/Opensearch services and standalone Kafka on fargate as well.

gms-deploy-logs.zip

bland-gigabyte-28270

06/10/2023, 5:42 AM

Same issue, can someone help?

bland-gigabyte-28270

06/12/2023, 1:00 AM

We are encountering the same problem. Can someone help?

elegant-article-21703

06/13/2023, 8:28 AM

Hi again, I've been playing with some combinations and I realised that: • If I don't apply a role to the new user I'm creating, when login/out I got an error in the login page • When applying a role to a new user, if this user belongs to a group, the privileges restrictions I applied are surpass by the role privileges (reader on this case) • If I take out the role from that user, it cannot access any of the assets (regardless of the policy applied to the user group) Does anyone faced something similar? Thanks everyone in advance (more info of my context in the thread)

elegant-salesmen-99143

06/13/2023, 12:57 PM

Hi Team, sorry for repeating, but it's been few weeks since I tried to get help with Analytics tab issue where getHighlghts and getAnalyticsChart queries on it return empty from backend, and Analytics doesn't work. I don't know what can we do in this situation and really need help from Team, please🙏🙏🙏 The logs have been provided as requested, but no answer so far, I don't know if anyone saw it

thankful-morning-85093

06/14/2023, 10:56 PM

Hi All, I tried to upgrade our Datahub deployment from 0.8.45 to 0.10.4. Still getting "Unauthorized" while the GMS pod does not through any error

elegant-guitar-28442

06/15/2023, 6:05 AM

Thank you very much! I solved the problem according to your prompts. I will contribute a PR to fix this BUG.

thank you 1

adorable-lawyer-88494

06/16/2023, 7:51 AM

FYI @best-umbrella-88325

incalculable-portugal-45517

06/19/2023, 5:04 PM

bump 🙂

bland-gigabyte-28270

06/22/2023, 1:17 AM

Resurface it here, I’m still having problems even with

max_threads

fix. This is

0.10.3

using Snowflake: Config:

Copy code

source:
    type: snowflake
    config:
        account_id: <account-id>
        include_table_lineage: true
        include_view_lineage: true
        include_tables: true
        include_views: true
        profiling:
            enabled: true
            profile_table_level_only: true
        stateful_ingestion:
            enabled: true
        warehouse: DATAHUB_WH
        username: datahub_user
        role: DATAHUB_READER
        database_pattern:
            allow:
                - PATTERN
        password: '${SNOWFLAKE_DATAHUB_USER_PASSWORD}'
sink:
    type: datahub-rest
    config:
        server: '<http://datahub-datahub-gms:8080/>'
        max_threads: 1

Logs:

Copy code

{
          "error": "Unable to emit metadata to DataHub GMS: javax.persistence.PersistenceException: Error when batch flush on sql: update metadata_aspect_v2 set metadata=?, createdOn=?, createdBy=?, createdFor=?, systemmetadata=? where urn=? and aspect=? and version=?",
          "info": {
            "exceptionClass": "com.linkedin.restli.server.RestLiServiceException",
            "message": "javax.persistence.PersistenceException: Error when batch flush on sql: update metadata_aspect_v2 set metadata=?, createdOn=?, createdBy=?, createdFor=?, systemmetadata=? where urn=? and aspect=? and version=?",
            "status": 500,
            "id": "urn:li:dataset:(urn:li:dataPlatform:snowflake,arene.aha.kfk_aha_feature,PROD)"
          }
        },
        {
          "error": "Unable to emit metadata to DataHub GMS: javax.persistence.PersistenceException: Error when batch flush on sql: update metadata_aspect_v2 set metadata=?, createdOn=?, createdBy=?, createdFor=?, systemmetadata=? where urn=? and aspect=? and version=?",
          "info": {
            "exceptionClass": "com.linkedin.restli.server.RestLiServiceException",
            "message": "javax.persistence.PersistenceException: Error when batch flush on sql: update metadata_aspect_v2 set metadata=?, createdOn=?, createdBy=?, createdFor=?, systemmetadata=? where urn=? and aspect=? and version=?",
            "status": 500,
            "id": "urn:li:dataset:(urn:li:dataPlatform:snowflake,arene.aha.kfk_aha_release,PROD)"
          }
        },

plus1 2

great-car-44033

07/03/2023, 12:02 PM

I too have the same issue which was reported by @salmon-exabyte-77928. Is there any plan to fix this issue in coming releases?

proud-intern-59151

07/11/2023, 6:31 AM

Hi @hundreds-photographer-13496, Thank you for your reply. I am just curious if it is necessary to ingest the Athena dataset (in my case) into DataHub, given that I am only submitting the Great Expectations’ validation results into DataHub. Do I really need to ingest my entire data into DataHub first? I have followed the below document link, and it doesn’t mention the need to pre-pollute the entire datasets into DataHub before submitting the respective metadata into it. https://datahubproject.io/docs/metadata-ingestion/integration_docs/great-expectations/ In my case, the logs say that my data source name (my_datasource) is not present in “platform_instance_map”, which I don’t get exactly.

Datasource my_datasource is not present in platform_instance_map.

🩺 1

rich-restaurant-61261

07/11/2023, 6:44 AM

Hi Team, I know this used to be blocked due to this awslabs/python-deequ#106, but I saw the deequ just got a new release, and should unblock this issue? I receive following error code when trying to ingest data from S3, and I am assuming we need a SPARK_VERSION environment variable to solver it? Supported values are: dict_keys(['3.3', '3.2', '3.1', '3.0', '2.4']) @gray-shoe-75895 @big-carpet-38439

Copy code

[2023-07-11 06:32:40,593] ERROR    {datahub.entrypoints:199} - Command failed: Failed to find a registered source for type s3: SPARK_VERSION environment variable is required. Supported values are: dict_keys(['3.3', '3.2', '3.1', '3.0', '2.4'])
Traceback (most recent call last):
  File "/tmp/datahub/ingest/venv-s3-0.10.4/lib/python3.10/site-packages/pydeequ/configs.py", line 26, in _get_spark_version
    spark_version = os.environ["SPARK_VERSION"]
  File "/usr/local/lib/python3.10/os.py", line 680, in __getitem__
    raise KeyError(key) from None
KeyError: 'SPARK_VERSION'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/tmp/datahub/ingest/venv-s3-0.10.4/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 120, in _add_init_error_context
    yield
  File "/tmp/datahub/ingest/venv-s3-0.10.4/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 220, in __init__
    source_class = source_registry.get(source_type)
  File "/tmp/datahub/ingest/venv-s3-0.10.4/lib/python3.10/site-packages/datahub/ingestion/api/registry.py", line 183, in get
    tp = self._ensure_not_lazy(key)
  File "/tmp/datahub/ingest/venv-s3-0.10.4/lib/python3.10/site-packages/datahub/ingestion/api/registry.py", line 127, in _ensure_not_lazy
    plugin_class = import_path(path)
  File "/tmp/datahub/ingest/venv-s3-0.10.4/lib/python3.10/site-packages/datahub/ingestion/api/registry.py", line 57, in import_path
    item = importlib.import_module(module_name)
  File "/usr/local/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/tmp/datahub/ingest/venv-s3-0.10.4/lib/python3.10/site-packages/datahub/ingestion/source/s3/__init__.py", line 1, in <module>
    from datahub.ingestion.source.s3.source import S3Source
  File "/tmp/datahub/ingest/venv-s3-0.10.4/lib/python3.10/site-packages/datahub/ingestion/source/s3/source.py", line 12, in <module>
    import pydeequ

victorious-monkey-86128

07/11/2023, 4:47 PM

Hi, also some more info during build process:

> Task :docker:kafka-setup:docker

#12 ERROR: process "/bin/sh -c mkdir -p /opt   && mirror=$(curl --stderr /dev/null <https://www.apache.org/dyn/closer.cgi>\\?as_json\\=1 | jq -r '.preferred')   && curl -sSL \"${mirror}kafka/${KAFKA_VERSION}/kafka_${SCALA_VERSION}-${KAFKA_VERSION}.tgz\"   | tar -xzf - -C /opt   && mv /opt/kafka_${SCALA_VERSION}-${KAFKA_VERSION} /opt/kafka   && adduser -DH -s /sbin/nologin kafka   && chown -R kafka: /opt/kafka   && echo \"===> Installing python packages ...\"    && pip install --no-cache-dir jinja2 requests   && pip install --prefer-binary --prefix=/usr/local --upgrade \"${PYTHON_CONFLUENT_DOCKER_UTILS_INSTALL_SPEC}\"   && rm -rf /tmp/*   && apk del --purge .build-deps" did not complete successfully: exit code: 1

------

> [stage-1  5/15] RUN mkdir -p /opt   && mirror=$(curl --stderr /dev/null <https://www.apache.org/dyn/closer.cgi?as_json=1> | jq -r '.preferred')   && curl -sSL "${mirror}kafka/3.4.0/kafka_2.13-3.4.0.tgz"   | tar -xzf - -C /opt   && mv /opt/kafka_2.13-3.4.0 /opt/kafka   && adduser -DH -s /sbin/nologin kafka   && chown -R kafka: /opt/kafka   && echo "===> Installing python packages ..."    && pip install --no-cache-dir jinja2 requests   && pip install --prefer-binary --prefix=/usr/local --upgrade "git+<https://github.com/confluentinc/confluent-docker-utils@v0.0.58>"   && rm -rf /tmp/*   && apk del --purge .build-deps:

#12 1.144 tar: invalid magic

#12 1.144 tar: short read

------                                                                                                                                                                                                                                                                       Dockerfile:31

--------------------

30 |     RUN apk add --no-cache -t .build-deps git curl ca-certificates jq gcc musl-dev libffi-dev zip

31 | >>> RUN mkdir -p /opt \

32 | >>>   && mirror=$(curl --stderr /dev/null <https://www.apache.org/dyn/closer.cgi>\?as_json\=1 | jq -r '.preferred') \

33 | >>>   && curl -sSL "${mirror}kafka/${KAFKA_VERSION}/kafka_${SCALA_VERSION}-${KAFKA_VERSION}.tgz" \

34 | >>>   | tar -xzf - -C /opt \

35 | >>>   && mv /opt/kafka_${SCALA_VERSION}-${KAFKA_VERSION} /opt/kafka \

36 | >>>   && adduser -DH -s /sbin/nologin kafka \

37 | >>>   && chown -R kafka: /opt/kafka \

38 | >>>   && echo "===> Installing python packages ..."  \

39 | >>>   && pip install --no-cache-dir jinja2 requests \

40 | >>>   && pip install --prefer-binary --prefix=/usr/local --upgrade "${PYTHON_CONFLUENT_DOCKER_UTILS_INSTALL_SPEC}" \

41 | >>>   && rm -rf /tmp/* \

42 | >>>   && apk del --purge .build-deps

43 |

--------------------

ERROR: failed to solve: process "/bin/sh -c mkdir -p /opt   && mirror=$(curl --stderr /dev/null <https://www.apache.org/dyn/closer.cgi>\\?as_json\\=1 | jq -r '.preferred')   && curl -sSL \"${mirror}kafka/${KAFKA_VERSION}/kafka_${SCALA_VERSION}-${KAFKA_VERSION}.tgz\"   | tar -xzf - -C /opt   && mv /opt/kafka_${SCALA_VERSION}-${KAFKA_VERSION} /opt/kafka   && adduser -DH -s /sbin/nologin kafka   && chown -R kafka: /opt/kafka   && echo \"===> Installing python packages ...\"    && pip install --no-cache-dir jinja2 requests   && pip insta$l --prefer-binary --prefix=/usr/local --upgrade \"${PYTHON_CONFLUENT_DOCKER_UTILS_INSTALL_SPEC}\"   && rm -rf /tmp/*   && apk del --purge .build-deps" did not complete successfully: exit code: 1

> Task :docker:kafka-setup:docker FAILED

bitter-wire-42401

07/12/2023, 12:00 PM

I was able to resolve the issue by following the steps here https://datahubspace.slack.com/archives/C029A3M079U/p1680557916230119?thread_ts=1680516752.955739&cid=C029A3M079U But now

datahub docker ingest-sample-data

does not work ERROR {datahub.ingestion.run.pipeline:68} - failed to write record with workunit file

some-crowd-4662

07/14/2023, 7:10 PM

Ingest Log

some-crowd-4662

07/17/2023, 3:18 AM

@hundreds-photographer-13496 Hi, i turedn on the debug mode and then i saw the follwing error

brave-engine-32813

07/19/2023, 4:47 AM

Hi everyone , Anyone facing issue connecting to ssl enabled s3 or minio in datahub UI ingestion? If you are connecting using s3 delta lake source config, is verify_ssl parameter working as expected? Thanks

nutritious-bird-77396

07/19/2023, 3:39 PM

@delightful-ram-75848 Let me rephrase the question I am able to ingest redshift tables, schemas and views but for views the schema is not pulled, is it currently supported in datahub?

some-crowd-4662

07/19/2023, 6:52 PM

yes i can hit this url in the browser

bland-barista-59197

07/25/2023, 7:17 PM

Hi @delightful-ram-75848 is it possible to run this query

/q browsePaths: /datasets/prod/hive*

? I’m getting error. 500 Server_error.

eager-nest-72774

08/02/2023, 4:39 PM

@hundreds-photographer-13496 On the kubernetis cluster using boto3 i generated the credentials and i had passed these credentials in

s3_resource = boto3.resource('s3', aws_access_key_id=access_key, aws_secret_access_key=secret_key, aws_session_token=token)

the credentials are working in boto3 but when i passed the same credentials in the delta lake ingestion recipe it is not working on the pod of kubernetis cluster

bland-barista-59197

08/03/2023, 4:10 PM

Hi @delightful-ram-75848 any update?