On larger clusters, the coordinator and overlord are usually deployed separately. Adding more of them helps for high availability but only one of them is ever the leader.
A couple of ideas on what to check:
• do you have enough capacity to run 121 tasks concurrently, if there are not enough worker slots, the controller process will timeout and fail.
• Is this on kubernetes? If the coordinator/overlord is failing and there are no errors, perhaps it is exceeding its memory/cpu limit and getting killed by kubernetes. Are the jvm settings in line with the pod resources?
• How busy was the coordinator node in terms of CPU and memory when it failed? Does it have enough resources?