Hi folks I notice `DataHubUpgradeKafkaListener` is using onl DataHub #all-things-deployment

Hi folks, I notice `DataHubUpgradeKafkaListener` i...

numerous-byte-87938

06/21/2023, 6:27 PM

Hi folks, I notice

DataHubUpgradeKafkaListener

is using only one consumer group and am wondering what if there’re several GMS instances bootstrapping. From this community thread, it looks like a GMS instance had trouble getting assigned a Kafka partition, and based on the default backoff wait time limit, it hangs there for a long time before realizing “Oh, the partition is not assigned at all, maybe I should restart and force to rebalance”. I do see some scenarios where a rebalance is kicked off during the wait, however, often times, I see it hanging there and having trouble getting out. So I’m curious 1. What’s the benefit for having one consumer group for DataHubUpgradeKafkaListener? 2. Is there a way to help the

partitions assigned: []

issue?

✅ 1

delightful-ram-75848

06/22/2023, 4:51 AM

@orange-night-91387 might be able to speak to this!

creamy-van-28626

06/22/2023, 5:44 AM

I have faced the similar kind off issue when gms restarted it is unable to rejoin the consumer group at the time of bootstrapping. We restarted several time but it was unable to rejoin and after some it automatically rejoined. Why it is happening?

numerous-byte-87938

06/22/2023, 9:06 PM

One interesting found is that, in our case, the partition was held by a standalone MAE consumer, while a GMS instance tries to bootstrap. For example, • GMS logs.

Copy code

2023-06-22 20:45:23,690 [main] INFO  c.l.metadata.boot.BootstrapManager:26 - Starting Bootstrap Process...
2023-06-22 20:45:23,690 [main] INFO  c.l.metadata.boot.BootstrapManager:33 - Executing bootstrap step 1/13 with name WaitForSystemUpdateStep...
...
2023-06-22 20:45:25,083 [ThreadPoolTaskExecutor-1] INFO  o.a.k.c.c.i.AbstractCoordinator:503 - [Consumer clientId=consumer-generic-duhe-consumer-job-client-1, groupId=generic-duhe-consumer-job-client] Successfully joined group with generation 4225
2023-06-22 20:45:25,085 [ThreadPoolTaskExecutor-1] INFO  o.a.k.c.c.i.ConsumerCoordinator:273 - [Consumer clientId=consumer-generic-duhe-consumer-job-client-1, groupId=generic-duhe-consumer-job-client] Adding newly assigned partitions:

• MAE consumer logs

Copy code

2023-06-22 20:45:25,067 [ThreadPoolTaskExecutor-1] INFO  o.a.k.c.c.i.AbstractCoordinator - [Consumer clientId=consumer-generic-duhe-consumer-job-client-1, groupId=generic-duhe-consumer-job-client] Attempt to heartbeat failed since group is rebalancing
2023-06-22 20:45:25,070 [ThreadPoolTaskExecutor-1] INFO  o.a.k.c.c.i.ConsumerCoordinator - [Consumer clientId=consumer-generic-duhe-consumer-job-client-1, groupId=generic-duhe-consumer-job-client] Revoke previously assigned partitions datahub.upgrade_history-0
2023-06-22 20:45:25,070 [ThreadPoolTaskExecutor-1] INFO  o.a.k.c.c.i.AbstractCoordinator - [Consumer clientId=consumer-generic-duhe-consumer-job-client-1, groupId=generic-duhe-consumer-job-client] (Re-)joining group
2023-06-22 20:45:25,082 [ThreadPoolTaskExecutor-1] INFO  o.a.k.c.c.i.AbstractCoordinator - [Consumer clientId=consumer-generic-duhe-consumer-job-client-1, groupId=generic-duhe-consumer-job-client] Successfully joined group with generation 4225
2023-06-22 20:45:25,082 [ThreadPoolTaskExecutor-1] INFO  o.a.k.c.c.i.ConsumerCoordinator - [Consumer clientId=consumer-generic-duhe-consumer-job-client-1, groupId=generic-duhe-consumer-job-client] Adding newly assigned partitions: datahub.upgrade_history-0

• Timeline 1. [2023-06-22 204523,690] A GMS pod tried to bootstrap. 2. [2023-06-22 204525,067] The MAE consumer (which was holding the partition) realized the consumer group rebalancing. 3. [2023-06-22 204525,070] The MAE consumer revoked the previously assigned (and the only) partition, and rejoin the group. (so far so good) 4. [2023-06-22 204525,082] The same MAE consumer got assigned the only partition again. 5. [2023-06-22 204525,085] The bootstrapping GMS pod didn’t get a partition assigned. Therefore, it hangs there forever, and ultimately times out the deployment. Two questions remain unclear to me, • Why MAE need to periodically poll the “datahub.upgrade_history” topic? • How can we get rid of this rebalance issue?

aloof-gpu-11378

06/26/2023, 4:36 PM

One interesting found is that, in our case, the partition was held by a standalone MAE consumer, while a GMS instance tries to bootstrap

What version of the helm charts are you using?: https://github.com/acryldata/datahub-helm/blob/master/charts/datahub/subcharts/datahub-gms/templates/deployment.yaml#L114 https://github.com/acryldata/datahub-helm/blob/master/charts/datahub/subcharts/datahub-mae-consumer/templates/deployment.yaml#L87 We split the consumer group for MAE and GMS to avoid this

numerous-byte-87938

06/26/2023, 6:38 PM

That’s exactly it, thanks a ton!

creamy-van-28626

07/06/2023, 8:41 AM

In which version of chart you have fix this?

numerous-byte-87938

07/06/2023, 6:42 PM

We are not using Helm charts for our deployment, but separating consumer group using env

DATAHUB_UPGRADE_HISTORY_KAFKA_CONSUMER_GROUP_ID

solves the issue.

creamy-van-28626

07/07/2023, 9:03 AM

What’s the purpose of this upgrade topic

creamy-van-28626

07/07/2023, 9:03 AM

Why it is used by three consumer groups

Open in Slack

Previous Next