This message was deleted.
# troubleshooting
s
This message was deleted.
s
On larger clusters, the coordinator and overlord are usually deployed separately. Adding more of them helps for high availability but only one of them is ever the leader. A couple of ideas on what to check: • do you have enough capacity to run 121 tasks concurrently, if there are not enough worker slots, the controller process will timeout and fail. • Is this on kubernetes? If the coordinator/overlord is failing and there are no errors, perhaps it is exceeding its memory/cpu limit and getting killed by kubernetes. Are the jvm settings in line with the pod resources? • How busy was the coordinator node in terms of CPU and memory when it failed? Does it have enough resources?
a
what version are you on?
I agree with Sergio that your master nodes do seem underprovisioned for the number of workers you are running.
b
Hey @Sergio Ferragut @Abhishek Agarwal Thanks for helping. Using a bigger node for master server helped me here. Btw my druid version is 25.0.0 and its on kubernetes.