Hello, I am trying to update datahub from 0.9.5 to...
# troubleshoot
a
Hello, I am trying to update datahub from 0.9.5 to 0.10.0. I ran the system upgrade job, and now GMS is giving me this error:
Copy code
2023-03-09 09:29:44,122 [I/O dispatcher 1] INFO c.l.m.s.e.update.BulkListener:47 - Successfully fed bulk request. Number of events: 1 Took time ms: -1
2023-03-09 09:30:23,729 [R2 Nio Event Loop-1-1] WARN c.l.r.t.h.c.c.ChannelPoolLifecycle:139 - Failed to create channel, remote=localhost/127.0.0.1:8080
io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: localhost/127.0.0.1:8080
Caused by: java.net.ConnectException: Connection refused
at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:777)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:337)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:776)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at java.base/java.lang.Thread.run(Thread.java:829)
Any ideas?
1
plus1 7
g
What kind of deployment is it? Docker-compose within a VM?
a
nope. kubernetes with helm
g
Weird. Did you just change the global version variable or you made any other changes to values?
a
hm.. no
i also tried with the dns name of the service
still the same
g
Did you update your chart as well?
a
i updated the env variable in datahub-gms deployment.yaml
i verified in the logs that it picked the value
g
I'm not sure but I think the latest version has some changes not only to env variables, but another batch job called systemUpdate there is necessary to reindex your elasticsearch indexes.
a
i did that
g
Anyway, this is weird, because appears that GMS is trying to create a connection with himself. In my config, GMS uses port 8080.
a
it is also trying 8080
g
I tried to help, but I'm out of ideas about what could be
a
ok thanks
a
Hi this might have to do with a regression in a recent version we’re currently looking into/fixing- keep an eye on #announcements for more info
b
Hi @agreeable-belgium-70840! It seems you need to run the
datahub-upgrade
container. It will perform an upgrade on your system that will allow the rest of the system components to update!
a
@big-carpet-38439 i've done that already
actually I wiped out all the data
I am using the current helm chart from datahub
ran the init jobs
and I am still getting the same
I start to run out of ideas now
a
Hi Yianni- what helm charts are you using, and do you have any modifications in your deploy?
a
I am using the helm charts from here: https://github.com/acryldata/datahub-helm Nope, no modifications...
a
@astonishing-answer-96712 @agreeable-belgium-70840 - I’m actually facing the same issue
Copy code
2023-03-15 17:33:10,411 [R2 Nio Event Loop-1-1] WARN  c.l.r.t.h.c.c.ChannelPoolLifecycle:139 - Failed to create channel, remote=localhost/127.0.0.1:8080
io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: localhost/127.0.0.1:8080
This is what I did: 1 - Ran elasticsearch-setup 2 - Ran kafka-setup 3 - Ran datahub-upgrade -u SystemUpdate 4 - Error when trying to start datahub-gms PS: I have a self-hosted ES and Kafka. Any ideas? Thanks!
g
Can you share the pod details? The GMS_HOST environment variable may be wrong.
a
@gentle-camera-33498 I’m trying to run GMS as a Docker container. I can’t see a variable called GMS_HOST.
g
The GMS_HOST is an environment variable used by other services to retrieve the host and port where the GMS server is hosted (generally in format <host>:<port>)
a
Yes, but I’m getting the connection refused from GMS, not other services. That’s the weird part.
It seems like GMS is trying to reach out to itself before the startup process is complete.
g
Ok, so that are the logs from the GMS's container, right? If so, It's really weird.
a
Yes.
g
Anyway, could you share the GMS container environment variables? Just to have a look.
a
Sure, I can share.
c
I am having the same issue in DataHub
v0.10.0
In my case I was running an ingestion and the kafka broker disk went 100%. After increasing the disk the GMS container keep failing and restarting with this error.
a
Hi @careful-garden-46928, could you share your env config on GMS?
c
Sure @astonishing-answer-96712
Copy code
ENABLE_PROMETHEUS: 'true',
                    DATAHUB_SERVER_TYPE: 'quickstart',
                    DATAHUB_TELEMETRY_ENABLED: 'true',
                    DATASET_ENABLE_SCSI: 'false',
                    EBEAN_DATASOURCE_HOST: `${rdsHost}`,
                    EBEAN_DATASOURCE_DRIVER: 'com.mysql.jdbc.Driver',
                    KAFKA_SCHEMAREGISTRY_URL: `http://${schemaRegistryHost}:8081`,
                    KAFKA_BOOTSTRAP_SERVER: `${kafkaBootstrapServer}`,
                    EBEAN_DATASOURCE_URL: `${ebeanDataSourceUrl}`,
                    ELASTICSEARCH_PORT: '443',
                    ELASTICSEARCH_USE_SSL: 'true',
                    GRAPH_SERVICE_IMPL: 'elasticsearch',
                    ENTITY_REGISTRY_CONFIG_PATH: '/datahub/datahub-gms/resources/entity-registry.yml',
                    MAE_CONSUMER_ENABLED: 'true',
                    MCE_CONSUMER_ENABLED: 'true',
                    PE_CONSUMER_ENABLED: 'true',
                    UI_INGESTION_ENABLED: 'true',
                    METADATA_SERVICE_AUTH_ENABLED: 'true',
I can share a little bit more in regards of this issue: 1. I tried to cleanup all the topics because I though that too many events to be ingested and still pending would cause the issue. After cleaning all the events the issue persisted. 2. Then I tried to delete the Kafka topics and recreate from scratch 3. Also deleted and recreated the ElasticSearch index (tearing down the AWS OpenSearch cluster and recreated) None of the previous steps worked.
We had this issue before and we could only manage to restore the system by applying our Phoenix Protocol. (Burn everything to the ground and recreating everything and re-ingesting all the data) This happened before in our DEV environment and now it happened to our INT environment. So far we could manage to restore the systems because we have all the setup automated in AWS but this is not a procedure we like to do all the time. It would be good to understand what we are doing wrong to cause this instability.
a
@dazzling-yak-93039 might be able to provide some insight here
d
Could you try clearing the docker cache and re-running it?
docker system prune -a
Context: We had a bug recently where the GMS image and the Upgrade image were not the same version, so GMS was always waiting for the Upgrade job to finish, because the versions didn't match. Clearing the docker images should let you download the new images that don't have this issue.
o
Note: this should only be the case for quickstart deployments, if you are specifying a released tag and not
head
something else is going on. If this is an environment with production data then please note that with the v0.10.0 release a reindex will occur and depending on the size of data can take several hours to resolve (this would be tens of millions of documents, large scale deployment). Deleting all your Kafka topics would definitely cause a problem and you would need to re-run kafka-setup and the upgrade job. The upgrade -> GMS communication is done through a kafka message which lets GMS know the upgrade is finished.
c
@dazzling-yak-93039 I can’t do a
docker system prune
because we are running datahub in AWS ECS. I assume that when the containers restart the containers could / would start in a new hardware on aws side and with a clean volume and system.
@orange-night-91387 We did successfully execute the upgrade job and we double checked the version. So both upgrade container and datahub are running version
0.10.0
I did delete the kafka topics to check if the tons of events in kafka would be causing some issue when the gms container was reinitialising. And afterwards I did execute the kafka-setup so all topics were back. This procedure did not change the errors so we still got the same error
io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: localhost/127.0.0.1:8080
Could it be that when the gms container starts it tries to pickup the last state it was before crashing and it is trying to continue processing the events pending during the crash? A reminder: In my case the system was working perfectly well after the migration. Once I started an S3 ingestion the system crashed in the middle of the process due to Kafka disk usage. After fixing the kafka storage the GMS start returning this errors. Note: I can see also that other containers are failing to connect to GMS.
Thank you for your time answering questions and helping me debugging the issue 🙂
o
I did delete the kafka topics to check if the tons of events in kafka would be causing some issue when the gms container was reinitialising. And afterwards I did execute the kafka-setup so all topics were back.
Did you re-execute DataHub Upgrade though? Without doing this GMS would not start. Since you completely cleared the topic data, the message GMS would be looking for that got sent from the first Upgrade run would not be there. This is probably your issue since your upgrade was working prior and is a different root cause than what is probably happening to others in this thread.
c
@orange-night-91387 Indeed I don’t remember executing the DataHub Upgrade deleting the topics. I will check wit my team mates and return here so other could benefit from the debugging.
So indeed we did not execute the upgrade after “cleaning” the kafka topics. We will pay attention to this scenario, if it happens again we will collect more information and try to consider the execution of the upgrade container. I will report again if it worked or not. Thank everybody who contributed to this discussion. 🤘
a
@careful-garden-46928 did it work? I am still having the same issue...
with v0.10.1 , so I am guessing that I am doing something wrong. In the beginning I was thinking that it is a gms bug
c
@agreeable-belgium-70840 since it was in our integration environment and we had a deadline we triggered our Phoenix Protocol and we simply destroyed and recreated everything 🙂 I would try: 1. running the datahub setup containers (kafka, ES and the others) to be sure the expected structure is available in the persistence layer 2. running the datahub-upgrade container 3. running the restore indices process to be sure all data in mysql is indexed correctly in ES
a
actually i made 0.10.1 work, but i had to wipe out all the data
f
Hello guys, I'm exactly facing this issue. I've updated the CLI version to 0.10.1 and after starting the datahub instance, i can not login. when i check datahub health, it says can not connect to datahub-gms (connection refused)
h
@fierce-monkey-46092 Hello! I have the same problem, if you find a solution please let me know
f
@agreeable-belgium-70840 hello sir, how did you get 0.10.1 to work? with quickstart or docker-compose yaml?
a
i am using kubernetes. but now I am facing another issue. it is using TLSv1.3 for kafka connection, I can't change it via the env variables, and the connection to kafka is timing out
w
Hello everyone! i'm facing with this issue too =( Anybody know how to fix that?
b
I'm facing this too today. Currently running datahub upgrade after doing a system prune. Using Elastic Cloud, Cloud SQL and then I have Kafka locally as part of the docker compose.
That seemed to fix it @wonderful-wall-76801
w
hmm, you mean docker system prune? i'm working with kubernetes and datahub helm chart
f
Hi all, I was also facing issues when upgrading from v0.9.6 to v0.10.2. Following the hint in the release notes to perform the command below usually failed due to connection issues with kafka, elastic etc.
Copy code
docker run acryldata/datahub-upgrade:v0.10.0 -u SystemUpdate
I could help myself using the command found in Restoring Search and Graph Indices from Local Database. I simply adapted the command as below. Be aware you might need to change the image version in
datahub-upgrade.sh
and update the used
docker.env
accordingly.
Copy code
./docker/datahub-upgrade/datahub-upgrade.sh -u SystemUpdate
Hope this helps
f
Copy code
./docker/datahub-upgrade/datahub-upgrade.sh -u SystemUpdate
I followed the above command with changing my docker.env file. Script run successfully first time. After i log into front the GMS-version is not upgraded. When i run the script again it's giving me a error. The error: Caused by: java.lang.IllegalStateException: Request cannot be executed; I/O reactor status: STOPPED
f
@fierce-monkey-46092 you need to update the containers first and then execute the command to perform some internal updates
f
@full-dentist-68591 I've searched from documentations and followed the steps, but still not sure how to update the containers first
c
Had the issue again upgrading from
v0.10.0
->
v0.10.2
I don’t understand way the GMS is failing with the connection refused error 😕
Copy code
2023-03-15 17:33:10,411 [R2 Nio Event Loop-1-1] WARN  c.l.r.t.h.c.c.ChannelPoolLifecycle:139 - Failed to create channel, remote=localhost/127.0.0.1:8080
io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: localhost/127.0.0.1:8080
In my setup I have the containers running on AWS ECS. I have the suspicion when AWS deploy the new TaskDefinition with the new image version, since it runs the new task in concurrency with the current old version DataHub is running 2 GMS containers at some point and I think it breaks something in the database. 😕 I am running some tests to check if this is the case.
f
I've updated all the containers and run the datahub_upgrade.sh successfly. But I'm getting Connection refused: localhost/127.0.0.1:8080 on GMS. what is this haha
a
@brainy-tent-14503 may be able to help here- seems like it’s widespread issue
b
The new GMS will not start until the
system-update
job finishes so it will simply wait. The required output from the
system-update
job is like
Copy code
2023-04-17 10:42:24 Executing Step 4/5: DataHubStartupStep...
2023-04-17 10:42:24 2023-04-17 15:42:24.582  INFO 1 --- [           main] c.l.d.u.s.e.steps.DataHubStartupStep     : Initiating startup for version: v0.10.2-0
2023-04-17 10:42:24 Completed Step 4/5: DataHubStartupStep successfully.
When using quickstart this is all handled for you and there is no need to execute the
datahub_upgrade.sh
script. If you are managing docker manually note that the referenced script is not aligning the version necessarily, it points to
head
and you are likely intending to deploy a specific version such as
v0.10.2
c
I want to contribute here with some information for future debugging purposes: My setup: • Containers running on AWS ECS Fargate • AWS managed services for Kafka, MySQL and ElasticSearch • Updating from
v0.10.0
to
v0.10.2
(Tried to update to
v0.10.1
before) Error message while bringing up the
v0.10.2
containers:
Copy code
2023-03-15 17:33:10,411 [R2 Nio Event Loop-1-1] WARN  c.l.r.t.h.c.c.ChannelPoolLifecycle:139 - Failed to create channel, remote=localhost/127.0.0.1:8080
io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: localhost/127.0.0.1:8080
Caused by: java.net.ConnectException: Connection refused
at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
...
After reading the code changes of release
v0.10.2
I noticed some changes would seems to indicate it was necessary to run the command
docker run --rm --env-file docker_upgrade.env acryldata/datahub-upgrade:v0.10.2 -u SystemUpdate
and confirmed by @brainy-tent-14503 message in this thread. After executing successfully the SysUpdate command all errors disappeared and I was able to run
v0.10.2
Conclusion: It seems that it is always required to execute the
SystemUpdate
command before updating to any new version. It wasn’t clear to me and my team that the SystemUpdate command was required to be executed between minor version updates. There is no information regarding this in the release notes, neither in the official online documentation. From version
v0.9.X
to
v0.10.X
there was a disclaimer informing that SystemUpdate command was required. I would expect also such note in this recent release.
o
Datahub upgrade is a required component and will need to be run for all releases. In some it will be a no-op, but this is the intended set up for any migration executions needed on the backend side going forward. This was called out in the town hall presentation as well as updated in the helm charts and docker compose configuration. We can make this more clear in the documentation and release notes as well going forward.
s
@orange-night-91387 , im having the same issues as above. regarding your message linked below: https://datahubspace.slack.com/archives/C029A3M079U/p1679343681184609?thread_ts=1678354501.041149&amp;cid=C029A3M079U Can you please indicate via which kafka topic this update message is sent to GMS? I ran the
-u SystemUpdate
command "succesfully", but i noticed in the logs:
Copy code
INFO - 2023-04-20 13:20:22.761 ERROR 1 --- [main] c.l.m.dao.producer.KafkaHealthChecker    : Failed to emit History Event for entity Event Version: v0.10.2-0
INFO - 
INFO - org.apache.kafka.common.errors.TimeoutException: Topic DataHubUpgradeHistory_v1 not present in metadata after 60000 ms.
INFO - 
INFO - 2023-04-20 13:20:22.762  INFO 1 --- [main] c.l.d.u.s.e.steps.DataHubStartupStep     : Initiating startup for version: v0.10.2-0
INFO - Completed Step 4/5: DataHubStartupStep successfully.
Does this mean that the SystemUpdate message was never posted to the
DataHubUpgradeHistory_v1
topic , and therefor never consumed by GMS?
a
The process that creates the topic is called
kafka-setup
this docker container creates the topics including
DataHubUpgradeHistory_v1
. The helm chart performs these and other setup jobs before the system update. As far as I know there is no way to apply the chart against ECS and there you will have run all the containers manually in the order specified by the helm hooks which also indicate when during an install/update to run and in what order. Examples: 1, 2. More information about this is available in the helm documentation.
s
So i have resolved this issue. Because we used customised helm charts, and managed MSK with some rules on creating topics, we are not able to use the kafka-setup container to create topics. had to manually create the topics and add a few additional kafka related (SASL, JAAS) env vars to the upgrade container. once these were added, the upgrade container was able to connect to the
DataHubUpgradeHistory_v1
topic and gms was able to consume the upgrade messages from it. no more localhost connection refused errors.
not sure why the
SystemUpgrade
step passed if it could not post to the
DataHubUpgradeHistory_v1
topic. surely it should have failed?
e
What are the env vars to set for the upgrade container when SSL is enabled on kafka cluster? Facing the same exact issue when I am trying to connect to SSL enabled kafka on the SystemUpdate job
b
@early-kitchen-6639 The datahub-upgrade kafka related variables are pulled from the global configuration see here. The full list of spring kafka parameters are documented in the spring docs here and some of them are discussed in the docs here. Those would be configured in this section springKafkaConfigurationOverrides of the helm values.
Please provide the logs from the datahub system update job which helm runs during the pre-install/upgrade step.
h
@brainy-tent-14503 PFA zipped logs of system update job
b
I am looking into the exception in the log.
Ok, this log is from the post-GMS start job. There should be a pod contains the string
dh-system-update
whereas this log looks like
dh-nocode-migration
logs
If in fact the logs are from a pod with
system-update
then something is being lost when executing the pod, likely the cli arguments here which will run the system-update logic.
@helpful-dream-67192 ☝️ Let me know if you would be able to share the pod’s manifest and we can look for the right args.
h
@brainy-tent-14503 Thank you so much. System-update was disabled hence it was running and after enabling it, datahub is working properly. Two queries for you, 1. Do we always need to run system-update on every datahub helm upgrade? 2. Do we always need to run setup jobs(elasticsearch, kafka, mysql) on every datahub helm upgrade?
a
1.) Yes, this is required. It will essentially handle reindexing and clean-up of previous backups/cloned indices. In the future it may also perform database migrations and other steps required to update the system prior to GMS starting the new version.
2.) For the setup jobs, they are typically only required for initial setup, however it is possible that new topic is added for kafka or some other configuration needs to be applied. Again I would run them to be sure. The helm chart will run the 3 setup jobs first and then the system-update step per the helm chart hooks in the correct order.
c
Hi @brainy-tent-14503 , I finally got my helm charts to deploy successfully! There were a few things I had to do. 1. datahub-system-update-job.yml (We had issues previously with the system update job not being able to run with a helm pre-install hook, (I had set to a post-install hook) but had to set BACK to a "helm.sh/hook": pre-install,pre-upgrade 2. When the system update job was trying to run I was getting an error about various datahub secrets not being present. I then 1). scripted out the templated yaml for the secrets, 2). kubectl applied the secrets, 3). Restarted the datahub-system-update-job 3. This then allowed the GMS to complete the deployment. 4. I tried adding a "helm.sh/hook": pre-install,pre-upgrade annotation to my secrets originally, but that does not seem to work in helm, hence step 2.2 above. Thanks for all your help!
f
is there a particular order between system update/datahub upgrade and various other prerequisites? we ran both the upgrade related pods for going from 0.9.6.1 to 0.10.4 and still failing with the connection refused issue
b
Yes, there is an order to the containers which is controlled by helm using the hooks as mentioned earlier in the thread. To give you a whole picture, I’ve thrown together a quick diagram here. Work your way from the bottom to the top depending on your environment. Also feel free to start a new thread with your logs and we can continue the discussion there. @fierce-orange-10929
f
thanks @brainy-tent-14503!
671 Views