DataHub #all-things-deployment

fierce-finland-15121

04/18/2023, 6:41 PM

Hello. I am attempting to deploy Datahub using the helm chart. For some reason however the deployment is failing because the update job is failing. When I look at the stacktrace however, all I see is a pretty vague stack trace that looks mostly like internal spring boot things. The top level error is

Copy code

ERROR SpringApplication Application run failed
 org.springframework.beans.factory.UnsatisfiedDependencyException: Error creating bean with name 'upgradeCli': Unsatisfied dependency expressed through field 'noCodeUpgrade'; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'ebeanServer' defined in class path resource [com/linkedin/gms/factory/entity/EbeanServerFactory.class]: Bean instantiation via factory method failed; nested exception is org.springframework.beans.BeanInstantiationException: Failed to instantiate [io.ebean.EbeanServer]: Factory method 'createServer' threw exception; nested exception is java.lang.NullPointerException

📖 1

✅ 1

🔍 1

bland-orange-13353

04/19/2023, 5:29 AM

This message was deleted.

careful-lunch-53644

04/19/2023, 5:34 AM

Hi team， MYSQL，Kafka and Zookeeper already exist in the cluster， I want to reuse these services. How can I configure Docker yml file，Can you give an example？

flat-painter-78331

04/19/2023, 10:00 AM

Hi team, I'm using datahub deployed on Kubernetes and I'm trying to integrate it with Airflow. I have added the connections and installed the plugin on airflow worker, webserver and scheduler pods. But i cannot see the plugin in the airflow home page (Admin>Plugins) Can someone help me figure this out please? In the image attached below,

<https://scx-datahub.cxos.tech>

is where I've exposed the datahub application

best-daybreak-64419

04/19/2023, 6:59 PM

Hi Team! I need help! 🙇‍♂️ Before I ask my questions, I want to let you know my current situation because of the significant time difference that may cause delays in our communication. I am trying to deploy Datahub on AWS following the deployment guide. I have launched three EKS nodes(v1.24.9-eks) using Terraform. When I ran

helm install prerequisites

, the ‘prerequisites-cp-schema-registry’ pod kept failing and restarting, while other pods remained in the pending state. It was exactly the same issue mentioned in this thread on Slack. Although I added the EBS usage policy to EKS, PVC binding did not work, and when I ran the

kubectl get pv

command, no PVs were found. Then I checked

kubectl get storageclasses

and found an StorageClass with the name ‘standard.’ I finally succeeded in binding the PVC only after modifying the values.yaml file as follows, and I could see that

prerequisites-cp-schema-registry

was running normally.

Copy code

elasticsearch:
	...
# # Request smaller persistent volumes.
  volumeClaimTemplate:
   accessModes: ["ReadWriteOnce"]
   storageClassName: "standard"
   resources:
     requests:
       storage: 30Gi

...

mysql:
  enabled: true
  auth:
    # For better security, add mysql-secrets k8s secret with mysql-root-password, mysql-replication-password and mysql-password
    existingSecret: mysql-secrets
  global:
    storageClass: "standard"

I have deployed Datahub once on EKS version v1.23.13-eks with Datahub version 9.x using the same method. At that time, when I checked PVCs, the bound sc name was gp2 (default). However, when I added the EBS policy to EKS, PVC was bound immediately without modifying the values.yaml file. So,

my first question

is: With the EKS version upgrade, I couldn’t see gp2 as the default storage class, and there was only ‘standard’ (which doesn’t have a default name). Therefore, I added the storageClass option in the values.yaml file to solve the issue. However, I’m wondering if creating a separate default storage class with VOLUMEBINDINGMODE set to WaitForFirstConsumer is the correct solution. `Second question`: The Kubernetes deployment document recommends installing ‘prerequisites’ and then installing ‘datahub/datahub’ using helm, citing dependencies. However, when RDS(mysql), MSK, and OpenSearch are already set up, should I set ‘enabled’ to true or false for es and mysql in the prerequisites values.yaml file? `Third question`: Do EKS nodes need to be at least 3 in number? Also, is it necessary to have 3 or more Kafka brokers? Thank you for taking the time to read through my lengthy question. If there is any part of my inquiry that you didn’t fully understand, please feel free to ask for clarification. I look forward to your response.

🔍 1

📖 1

bland-orange-13353

04/20/2023, 12:02 AM

This message was deleted.

✅ 1

rich-crowd-33361

04/20/2023, 12:18 AM

.......................................................... Unable to run quickstart - the following issues were detected: - quickstart.sh or dev.sh is not running If you think something went wrong, please file an issue at https://github.com/datahub-project/datahub/issues or send a message in our Slack https://slack.datahubproject.io/ Be sure to attach the logs from C:\Users\EDS_DA~1\AppData\Local\Temp\tmp6o64139g.log PS C:\Windows\system32>

bland-orange-13353

04/20/2023, 6:25 AM

This message was deleted.

✅ 1

microscopic-machine-90437

04/20/2023, 9:41 AM

Hello Everyone, I'm trying to ingest Snowflake metadata (I have my datahub setup in Kubernetes Cluster) and while doing so I'm getting storage error (

Copy code

ERROR: The ingestion process was killed, likely because it ran out of memory. You can resolve this issue by allocating more memory to the datahub-actions container.

When I go through the values.yml file, I could see that the datahub-actions container has 512 Mi as memory. My questions is, when we ingest metadata, in which container it will be stored. If the data from snowflake we are trying to ingest is in GBs, how large we have to scale our memory in the actions container? Is there a way to find out what is the size of the data/metadata we are trying to ingest(from snowflake/any other source). Can someone help me with this.

exec-urn_li_dataHubExecutionRequest_84aa3804-1cc9-4478-81a5-fe1c91d93f4b.log

🔍 1

✅ 1

📖 1

bland-gold-64386

04/20/2023, 12:32 PM

hiii Team , requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: when connecting airflow with datahub i am using below command

Copy code

airflow connections add --conn-type 'datahub_rest' 'datahub_rest_default' --conn-host '<http://domain.com|domain.com>'  --conn-password ''

✅ 1

rapid-hamburger-95729

04/20/2023, 1:47 PM

hello! are there any version requirements/restrictions for opensearch? we're going to upgrade our elasticsearch domain to opensearch and want to know if there's any recommended versions / known compatibility issues

📖 1

✅ 1

🔍 1

steep-doctor-17127

04/20/2023, 10:55 PM

Hello! So, I have done https://datahubproject.io/docs/deploy/kubernetes and no issue . However, now need to https://datahubproject.io/docs/deploy/aws Use AWS managed services for the storage layer and getting errors with Elasticsearch host connection. It is too long to explain all I have done but if anyone could help with. Thanks in advance

🔍 1

📖 1

powerful-cat-68806

04/21/2023, 11:13 AM

Hi, Need some help for upgrading to latest..

📖 1

🔍 1

most-animal-32096

04/21/2023, 12:28 PM

FYI, regarding deployment through Docker (Compose) and esp. "quickstart" ones, here is the PR #7880, aiming at improving startup thanks to more robust dependencies between services/containers.

✅ 1

microscopic-machine-90437

04/21/2023, 1:58 PM

Hello Everyone, I have a tableau ingestion which is scheduled 4 times a day and all of them were successfully executed yesterday. However, I see the status as 'failed' as the last status on the UI. Last execution time is also wrong. Can someone help me with this.

limited-forest-73733

04/24/2023, 1:51 PM

Hey team i am using datahub images version : 0.10.1 from acryldata and helm chart version 0.2.162 but datahub-gms is failing( pod is not coming up ) ERROR: Liveness Probe failed: HTTP probe failed with statuscode: 503 Readinesa Probe failed: HTTP probe failed with statuscode:503

🔍 1

📖 1

flat-painter-78331

04/25/2023, 11:01 AM

Hi team, Good day! I'm tying to deploy datahub on kubernetes, but im getting an error in the

datahub-system-update-job

saying

Error: secret "datahub-auth-secrets" not found

. I've had deployed datahub previously and was working fine as well. I deleted the deployment and am trying to re-deploy now, Can someone tell me if there's anything I need to do or something im not looking at please? Thanks in advance!

📖 1

🆘 1

🔍 1

plus1 1

bland-orange-13353

04/25/2023, 6:58 PM

This message was deleted.

✅ 1

bland-orange-13353

04/26/2023, 10:15 AM

This message was deleted.

✅ 1

early-kitchen-6639

04/26/2023, 11:23 AM

Hello, I am trying to deploy Datahub on EKS. I was able to properly setup elasticsearch and mysql, and I also have topics created in Kafka. While I deploy

datahub-gms

, I see that it is able to connect to all the endpoints properly but it is unable to resolve Kafka broker hostnames. We have our Kafka running on EKS using strimzi operator and I providing the boostrap server URL to datahub-gms. All other pods in our EKS are able to resolve broker hostnames correctly, seems like the issue is with datahub-gms image. Infact, we use the same endpoint for Pinot tables also. Has anyone faced this issue? Sharing the logs in thread. Please help. Thanks!

📖 1

🔍 1

many-rocket-80549

04/26/2023, 2:51 PM

Hi, I am trying to deploy Datahub on a Ubuntu 22.04 server but I am getting errors after doing the datahub docker quickstart. I am attaching a picture of the error, and the log file. Can you give me a hand? Thanks!

tmp34v7s_ti.log

prehistoric-wall-71780

04/26/2023, 9:16 PM

Hello. I tested deploy via Helm on GKE and via Datahub CLI/Docker. In both I had the same error when creating a connection with Bigquery:

Copy code

datahub.ingestion.run.pipeline.PipelineInitError: Failed to find a registered source for type bigquery: 'str' object is not callable

bumpy-activity-74405

04/27/2023, 7:51 AM

Hey can someone help me understand why

datahub-upgrade

is needed to run datahub-gms? Some background: I am running datahub on kubernetes for ~2 years now, I don't use your provided helm charts. I simply have two pods - one for gms and one for frontend. ES and mysql are not on k8s. And for the longest time that's all I needed - both frontend and gms recovered after being restarted. But as of

v0.10.1

(I think) gms just won't start if I don't run the upgrade container:

Copy code

2023-04-27 06:41:59,065 [R2 Nio Event Loop-1-2] WARN  c.l.r.t.h.c.c.ChannelPoolLifecycle:139 - Failed to create channel, remote=localhost/127.0.0.1:8080

I think it just gets stuck at this step:

Copy code

2023-04-27 06:41:39,668 [main] INFO  c.l.metadata.boot.BootstrapManager:33 - Executing bootstrap step 1/13 with name WaitForSystemUpdateStep...

I understand the need to reindex ES indices when a certain upgrade requires it (which I've done adhoc when upgrading

0.9.6.1

0.10.1

), but what's the point of it outside of that? Is there any way to avoid to have to run the upgrade each time gms starts?

wonderful-wall-76801

04/27/2023, 10:33 AM

Hello everyone! Who can say what is it at GMS logs

2023-04-27 10:24:44,740 [I/O dispatcher 1] INFO  c.l.m.s.e.update.BulkListener:47 - Successfully fed bulk request. Number of events: 6 Took time ms: -1

and why I see this string 5 time per second ? I think that this message talk me about some delay (or something like that) with working another tasks such as creating term into glossary or changing some permissions in settings. Because when I try to do this steps - I'm facing with the next problem:

Copy code

2023-04-27 08:08:04,881 [I/O dispatcher 1] ERROR c.l.m.s.e.update.BulkListener:44 - Failed to feed bulk request. Number of events: 7 Took time ms: -1 Message: failure in bulk execution:
[0]: index [glossarytermindex_v2_1682499154226], type [_doc], id [urn%3Ali%3AglossaryTerm%3A15618cc3-f96d-4f77-94d7-b6adb9e02ba8], message [[glossarytermindex_v2_1682499154226/vbvG5KPmSpiGQawAxCxalg][[glossarytermindex_v2_1682499154226][0]] ElasticsearchException[Elasticsearch exception [type=document_missing_exception, reason=[_doc][urn%3Ali%3AglossaryTerm%3A15618cc3-f96d-4f77-94d7-b6adb9e02ba8]: document missing]]]

great-branch-515

04/27/2023, 3:01 PM

Can anyone please tell me health check URL for gms and frontend service?

wonderful-book-58712

04/27/2023, 3:54 PM

Hi Team, Secrets are getting deleted from Datahub Instance each time we do an upgrade , Is there any workaround to fix this issue ?

astonishing-byte-5433

04/28/2023, 12:48 PM

Hey all, we are currently trying to deploy Datahub on a OnPrem Kubernetes Environment and not sure about some points: (1) External MySQL Size: Any suggestions how much CPU/RAM/Storage is expected? We have one Datahub Instance with 4k Tables and DBeaver shows only 16mb consumed. But not sure about CPU and RAM. The userbase would be just a few people. (2) Index&Graph Database: Is there any kind of highly recommended metadata we would lose if we don't host the elasticsearch as an external service and restore it from the mysql data? (3) MSSQL ODBC Connection: Currently we try to install the PyODBC + driver through Kubernetes postStart lifecycle running the pod as root, any suggestion if there is a more elegant way without building a custom actions container or manual shell login+install? Thanks!

📖 1

🔍 1

limited-forest-73733

04/30/2023, 4:46 PM

Hey team i want to ask about remove confluent schema-registry ? Do have any plan for that? Thanks

🔍 1

📖 1

limited-forest-73733

04/30/2023, 4:46 PM

Hey team i have a question can we attach dbt and airflow metadata to domains? Thanks

plus1 2

best-daybreak-64419

05/02/2023, 2:45 AM

Hi Team! I have a question about deploying Kubernetes Helm for DataHub. According to the deployment guide, we deploy Helm in two stages. Deploy datahub-prerequisites to install external dependencies and then deploy the service. Since I use all of AWS’s RDS, ES, and MSK, I do not use es, rds, or kafka, which are distributed as prerequisites. So, I can proceed with the deployment by setting enable to false, but I wonder if I need to do the prerequisites step even if I don’t use all the external dependency services written in the prerequisites value.