This message was deleted.
# troubleshooting
s
This message was deleted.
r
I think you are asking this because it's taking too long to coordinator to redistribute the data, right? coordinator will look each (druid.coordinator.period, default is 1minute) if you want to reduce downtime, you can (first, check why this query is bring down the historicals nodes!) but having replicas>2 can help other nodes still have the same segments even during historicals fairules
j
yes! I still want to know that why the coordinator can’t stop redistribute when historical recovered, that doesn’t sound very reasonable, and it really cost too long time.🫠 Now I know, I can reduce druid.coordinator.period so when historical exit, coordinator will reassign more often, but it also give a lot of pressure to coordinator.
r
hmm, after the nodes came back online, the coordinator will continue to redistribute, issuing drop commands to the remaining nodes that were loading segments before you can actually increase the time before druid try to 'heal' from a node failure, instead
you could try to increase
Copy code
druid.coordinator.balancer.strategy.minItemsToMove
but as you said that 18 nodes were crashed at the same time, I guess that this would trigger a rebalance nevertheless
j
Sorry, I didn’t fully understand. You thinks the issue occurs in dropping redistribute commands? Do you thinks this process pipeline need some optimize? You think minItemsToMove will speed up the segment balancer. Because of many historical crashed, then Coordinator stuck in assign segments. So I don’t want to give more pressure on Coordinator
r
I don't think the coordinator is stuck, unless you have, IDK, 100 million segments?
if you increase the minItemsToMove, it would make coordinator wait longer for the crashed nodes to come back before considering then dead, so it would not redistribute their segments
k
Is it possible you are still running the deprecated
select
queries? This query type was known to cause this issue in historical nodes. https://druid.apache.org/docs/latest/querying/select-query.html
j
Out cluster has 1.1million segment, these historical restart involved 0.5 million segments. Increase minItemsToMove will leads historical move more segments at one time. Why it make coordinator wait longer?
now we use druid 0.17.1, it use scan replace select