hello, I've run into an issue where someone has sc...
# troubleshooting
hello, I've run into an issue where someone has scaled up our pinot servers, left the new nodes for a while and then just removed the nodes back from the cluster - the API to remove the instances was never invoked and so was table rebalancing. Some segments were assigned to the new nodes and after these nodes were removed the tables went into bad state. The current state is that queries are not reading the segments in bad state. The table external view only refers to active nodes but some segments point at the removed nodes in the ideal state. The cluster still list the removed nodes when the instances API is called. I've try few things but I can't get the tables to be back in good state: I've rebalanced the table using different options, I've disabled the removed nodes in pinot and rebalanced the tables again, I've rebalanced the servers, but none of these have worked so far. I wonder if someone could let me know the steps to fix the issue
Note this is happening to both realtime and offline tables. I thought in a scenario like this another server would take over and the tables would get repaired immediately but this doesn't seem to be the case
my controller is on pinot 0.9.0, all the other nodes are on pinot 0.8.0
Reading through your original issue. However, definitely recommend not mixing and matching Pinot component versions
Are you able to untag/remove the bad instances?
I was doing the upgrade as per the docs so we left the controller running for a while when this happened
I can't remove the servers, the API returns 409 and it says the server is used in ideal states
I haven't tried untagging, bear with me
I was able to update the tags
I've rebalanced the table again but nothing happens
Hmm if tags are updated then rebalance would move them
Actually, I see more segments in bad state than before. I don't think is related to what I've just done though
Can you confirm the tag that you see on instances
Before untagging all were using DefaultTenant_OFFLINE and DefaultTenant_ONLINE
I've replaced that with "removed" for all the dead servers
Do you see the new tag on the instances though
Also did you try rebalance with downtime
Yes the servers are tagged as expected
I've just tried with downtime and it works but I don't understand why
I still have 3 servers running
Downtime is supposed to be used when a single instance is running?
For offline tables that doesn't work
I've managed to removed the servers by calling the endpoint as soon as I trigger rebalance on the table
One table would refresh its ideal state correctly but little after it would refer to the dead nodes again
Some segments still report bad status, I'll leave like this for a while
I just noticed that someone above had a similar issue
This is raising some questions, the fact that a node shuts down should mean that the segments it holds should now be served by another node transparently, is that not how it's supposed to work?
Totally understand this is not good experience. Let me dm you to help see what is going on
I've raised a github issue summarizing the problem at https://github.com/apache/pinot/issues/8281 please let me know if it needs more details - I'll try to include some logs. Thank you for the time you've spent looking at this problem
Thanks @Dan DC. To close the loop, we mitigated the issue by restarting controllers.