Hi folks, I notice `DataHubUpgradeKafkaListener` i...
# all-things-deployment
n
Hi folks, I notice
DataHubUpgradeKafkaListener
is using only one consumer group and am wondering what if there’re several GMS instances bootstrapping. From this community thread, it looks like a GMS instance had trouble getting assigned a Kafka partition, and based on the default backoff wait time limit, it hangs there for a long time before realizing “Oh, the partition is not assigned at all, maybe I should restart and force to rebalance”. I do see some scenarios where a rebalance is kicked off during the wait, however, often times, I see it hanging there and having trouble getting out. So I’m curious 1. What’s the benefit for having one consumer group for DataHubUpgradeKafkaListener? 2. Is there a way to help the
partitions assigned: []
issue?
1
d
@orange-night-91387 might be able to speak to this!
c
I have faced the similar kind off issue when gms restarted it is unable to rejoin the consumer group at the time of bootstrapping. We restarted several time but it was unable to rejoin and after some it automatically rejoined. Why it is happening?
n
One interesting found is that, in our case, the partition was held by a standalone MAE consumer, while a GMS instance tries to bootstrap. For example, • GMS logs.
Copy code
2023-06-22 20:45:23,690 [main] INFO  c.l.metadata.boot.BootstrapManager:26 - Starting Bootstrap Process...
2023-06-22 20:45:23,690 [main] INFO  c.l.metadata.boot.BootstrapManager:33 - Executing bootstrap step 1/13 with name WaitForSystemUpdateStep...
...
2023-06-22 20:45:25,083 [ThreadPoolTaskExecutor-1] INFO  o.a.k.c.c.i.AbstractCoordinator:503 - [Consumer clientId=consumer-generic-duhe-consumer-job-client-1, groupId=generic-duhe-consumer-job-client] Successfully joined group with generation 4225
2023-06-22 20:45:25,085 [ThreadPoolTaskExecutor-1] INFO  o.a.k.c.c.i.ConsumerCoordinator:273 - [Consumer clientId=consumer-generic-duhe-consumer-job-client-1, groupId=generic-duhe-consumer-job-client] Adding newly assigned partitions:
• MAE consumer logs
Copy code
2023-06-22 20:45:25,067 [ThreadPoolTaskExecutor-1] INFO  o.a.k.c.c.i.AbstractCoordinator - [Consumer clientId=consumer-generic-duhe-consumer-job-client-1, groupId=generic-duhe-consumer-job-client] Attempt to heartbeat failed since group is rebalancing
2023-06-22 20:45:25,070 [ThreadPoolTaskExecutor-1] INFO  o.a.k.c.c.i.ConsumerCoordinator - [Consumer clientId=consumer-generic-duhe-consumer-job-client-1, groupId=generic-duhe-consumer-job-client] Revoke previously assigned partitions datahub.upgrade_history-0
2023-06-22 20:45:25,070 [ThreadPoolTaskExecutor-1] INFO  o.a.k.c.c.i.AbstractCoordinator - [Consumer clientId=consumer-generic-duhe-consumer-job-client-1, groupId=generic-duhe-consumer-job-client] (Re-)joining group
2023-06-22 20:45:25,082 [ThreadPoolTaskExecutor-1] INFO  o.a.k.c.c.i.AbstractCoordinator - [Consumer clientId=consumer-generic-duhe-consumer-job-client-1, groupId=generic-duhe-consumer-job-client] Successfully joined group with generation 4225
2023-06-22 20:45:25,082 [ThreadPoolTaskExecutor-1] INFO  o.a.k.c.c.i.ConsumerCoordinator - [Consumer clientId=consumer-generic-duhe-consumer-job-client-1, groupId=generic-duhe-consumer-job-client] Adding newly assigned partitions: datahub.upgrade_history-0
• Timeline 1. [2023-06-22 204523,690] A GMS pod tried to bootstrap. 2. [2023-06-22 204525,067] The MAE consumer (which was holding the partition) realized the consumer group rebalancing. 3. [2023-06-22 204525,070] The MAE consumer revoked the previously assigned (and the only) partition, and rejoin the group. (so far so good) 4. [2023-06-22 204525,082] The same MAE consumer got assigned the only partition again. 5. [2023-06-22 204525,085] The bootstrapping GMS pod didn’t get a partition assigned. Therefore, it hangs there forever, and ultimately times out the deployment. Two questions remain unclear to me, • Why MAE need to periodically poll the “datahub.upgrade_history” topic? • How can we get rid of this rebalance issue?
a
One interesting found is that, in our case, the partition was held by a standalone MAE consumer, while a GMS instance tries to bootstrap
What version of the helm charts are you using?: https://github.com/acryldata/datahub-helm/blob/master/charts/datahub/subcharts/datahub-gms/templates/deployment.yaml#L114 https://github.com/acryldata/datahub-helm/blob/master/charts/datahub/subcharts/datahub-mae-consumer/templates/deployment.yaml#L87 We split the consumer group for MAE and GMS to avoid this
n
That’s exactly it, thanks a ton!
c
In which version of chart you have fix this?
n
We are not using Helm charts for our deployment, but separating consumer group using env
DATAHUB_UPGRADE_HISTORY_KAFKA_CONSUMER_GROUP_ID
solves the issue.
c
What’s the purpose of this upgrade topic
Why it is used by three consumer groups