numerous-byte-87938
06/21/2023, 6:27 PMDataHubUpgradeKafkaListener
is using only one consumer group and am wondering what if there’re several GMS instances bootstrapping. From this community thread, it looks like a GMS instance had trouble getting assigned a Kafka partition, and based on the default backoff wait time limit, it hangs there for a long time before realizing “Oh, the partition is not assigned at all, maybe I should restart and force to rebalance”.
I do see some scenarios where a rebalance is kicked off during the wait, however, often times, I see it hanging there and having trouble getting out. So I’m curious
1. What’s the benefit for having one consumer group for DataHubUpgradeKafkaListener?
2. Is there a way to help the partitions assigned: []
issue?delightful-ram-75848
06/22/2023, 4:51 AMcreamy-van-28626
06/22/2023, 5:44 AMnumerous-byte-87938
06/22/2023, 9:06 PM2023-06-22 20:45:23,690 [main] INFO c.l.metadata.boot.BootstrapManager:26 - Starting Bootstrap Process...
2023-06-22 20:45:23,690 [main] INFO c.l.metadata.boot.BootstrapManager:33 - Executing bootstrap step 1/13 with name WaitForSystemUpdateStep...
...
2023-06-22 20:45:25,083 [ThreadPoolTaskExecutor-1] INFO o.a.k.c.c.i.AbstractCoordinator:503 - [Consumer clientId=consumer-generic-duhe-consumer-job-client-1, groupId=generic-duhe-consumer-job-client] Successfully joined group with generation 4225
2023-06-22 20:45:25,085 [ThreadPoolTaskExecutor-1] INFO o.a.k.c.c.i.ConsumerCoordinator:273 - [Consumer clientId=consumer-generic-duhe-consumer-job-client-1, groupId=generic-duhe-consumer-job-client] Adding newly assigned partitions:
• MAE consumer logs
2023-06-22 20:45:25,067 [ThreadPoolTaskExecutor-1] INFO o.a.k.c.c.i.AbstractCoordinator - [Consumer clientId=consumer-generic-duhe-consumer-job-client-1, groupId=generic-duhe-consumer-job-client] Attempt to heartbeat failed since group is rebalancing
2023-06-22 20:45:25,070 [ThreadPoolTaskExecutor-1] INFO o.a.k.c.c.i.ConsumerCoordinator - [Consumer clientId=consumer-generic-duhe-consumer-job-client-1, groupId=generic-duhe-consumer-job-client] Revoke previously assigned partitions datahub.upgrade_history-0
2023-06-22 20:45:25,070 [ThreadPoolTaskExecutor-1] INFO o.a.k.c.c.i.AbstractCoordinator - [Consumer clientId=consumer-generic-duhe-consumer-job-client-1, groupId=generic-duhe-consumer-job-client] (Re-)joining group
2023-06-22 20:45:25,082 [ThreadPoolTaskExecutor-1] INFO o.a.k.c.c.i.AbstractCoordinator - [Consumer clientId=consumer-generic-duhe-consumer-job-client-1, groupId=generic-duhe-consumer-job-client] Successfully joined group with generation 4225
2023-06-22 20:45:25,082 [ThreadPoolTaskExecutor-1] INFO o.a.k.c.c.i.ConsumerCoordinator - [Consumer clientId=consumer-generic-duhe-consumer-job-client-1, groupId=generic-duhe-consumer-job-client] Adding newly assigned partitions: datahub.upgrade_history-0
• Timeline
1. [2023-06-22 204523,690] A GMS pod tried to bootstrap.
2. [2023-06-22 204525,067] The MAE consumer (which was holding the partition) realized the consumer group rebalancing.
3. [2023-06-22 204525,070] The MAE consumer revoked the previously assigned (and the only) partition, and rejoin the group. (so far so good)
4. [2023-06-22 204525,082] The same MAE consumer got assigned the only partition again.
5. [2023-06-22 204525,085] The bootstrapping GMS pod didn’t get a partition assigned. Therefore, it hangs there forever, and ultimately times out the deployment.
Two questions remain unclear to me,
• Why MAE need to periodically poll the “datahub.upgrade_history” topic?
• How can we get rid of this rebalance issue?aloof-gpu-11378
06/26/2023, 4:36 PMOne interesting found is that, in our case, the partition was held by a standalone MAE consumer, while a GMS instance tries to bootstrapWhat version of the helm charts are you using?: https://github.com/acryldata/datahub-helm/blob/master/charts/datahub/subcharts/datahub-gms/templates/deployment.yaml#L114 https://github.com/acryldata/datahub-helm/blob/master/charts/datahub/subcharts/datahub-mae-consumer/templates/deployment.yaml#L87 We split the consumer group for MAE and GMS to avoid this
numerous-byte-87938
06/26/2023, 6:38 PMcreamy-van-28626
07/06/2023, 8:41 AMnumerous-byte-87938
07/06/2023, 6:42 PMDATAHUB_UPGRADE_HISTORY_KAFKA_CONSUMER_GROUP_ID
solves the issue.creamy-van-28626
07/07/2023, 9:03 AMcreamy-van-28626
07/07/2023, 9:03 AM