This message was deleted Apache Druid #troubleshooting

Join Slack

This message was deleted.

# troubleshooting

Slackbot

06/04/2023, 12:53 PM

This message was deleted.

Renato Santos

06/05/2023, 8:59 AM

I think you are asking this because it's taking too long to coordinator to redistribute the data, right? coordinator will look each (druid.coordinator.period, default is 1minute) if you want to reduce downtime, you can (first, check why this query is bring down the historicals nodes!) but having replicas>2 can help other nodes still have the same segments even during historicals fairules

Jiaojiao Fu

06/05/2023, 10:05 AM

yes! I still want to know that why the coordinator can’t stop redistribute when historical recovered, that doesn’t sound very reasonable, and it really cost too long time.🫠 Now I know, I can reduce druid.coordinator.period so when historical exit, coordinator will reassign more often, but it also give a lot of pressure to coordinator.

Renato Santos

06/05/2023, 4:13 PM

hmm, after the nodes came back online, the coordinator will continue to redistribute, issuing drop commands to the remaining nodes that were loading segments before you can actually increase the time before druid try to 'heal' from a node failure, instead

Renato Santos

06/05/2023, 4:14 PM

you could try to increase

Copy code

druid.coordinator.balancer.strategy.minItemsToMove

Renato Santos

06/05/2023, 4:15 PM

but as you said that 18 nodes were crashed at the same time, I guess that this would trigger a rebalance nevertheless

Jiaojiao Fu

06/06/2023, 2:26 AM

Sorry, I didn’t fully understand. You thinks the issue occurs in dropping redistribute commands? Do you thinks this process pipeline need some optimize? You think minItemsToMove will speed up the segment balancer. Because of many historical crashed, then Coordinator stuck in assign segments. So I don’t want to give more pressure on Coordinator

Renato Santos

06/06/2023, 10:57 AM

I don't think the coordinator is stuck, unless you have, IDK, 100 million segments?

Renato Santos

06/06/2023, 10:59 AM

if you increase the minItemsToMove, it would make coordinator wait longer for the crashed nodes to come back before considering then dead, so it would not redistribute their segments

Kyle Hoondert

06/06/2023, 3:46 PM

Is it possible you are still running the deprecated

select

queries? This query type was known to cause this issue in historical nodes. https://druid.apache.org/docs/latest/querying/select-query.html

Jiaojiao Fu

06/07/2023, 2:24 AM

Out cluster has 1.1million segment, these historical restart involved 0.5 million segments. Increase minItemsToMove will leads historical move more segments at one time. Why it make coordinator wait longer?

Jiaojiao Fu

06/07/2023, 2:24 AM

now we use druid 0.17.1, it use scan replace select

Open in Slack

Previous Next