hello I ve run into an issue where someone has scaled up our Apache Pinot #troubleshooting

hello, I've run into an issue where someone has sc...

Dan DC

03/02/2022, 4:09 PM

hello, I've run into an issue where someone has scaled up our pinot servers, left the new nodes for a while and then just removed the nodes back from the cluster - the API to remove the instances was never invoked and so was table rebalancing. Some segments were assigned to the new nodes and after these nodes were removed the tables went into bad state. The current state is that queries are not reading the segments in bad state. The table external view only refers to active nodes but some segments point at the removed nodes in the ideal state. The cluster still list the removed nodes when the instances API is called. I've try few things but I can't get the tables to be back in good state: I've rebalanced the table using different options, I've disabled the removed nodes in pinot and rebalanced the tables again, I've rebalanced the servers, but none of these have worked so far. I wonder if someone could let me know the steps to fix the issue

Dan DC

03/02/2022, 4:10 PM

Note this is happening to both realtime and offline tables. I thought in a scenario like this another server would take over and the tables would get repaired immediately but this doesn't seem to be the case

Dan DC

03/02/2022, 4:11 PM

my controller is on pinot 0.9.0, all the other nodes are on pinot 0.8.0

Mayank

03/02/2022, 4:15 PM

Reading through your original issue. However, definitely recommend not mixing and matching Pinot component versions

Mayank

03/02/2022, 4:18 PM

Are you able to untag/remove the bad instances?

Dan DC

03/02/2022, 4:27 PM

I was doing the upgrade as per the docs so we left the controller running for a while when this happened

Dan DC

03/02/2022, 4:28 PM

I can't remove the servers, the API returns 409 and it says the server is used in ideal states

Dan DC

03/02/2022, 4:29 PM

I haven't tried untagging, bear with me

Dan DC

03/02/2022, 4:31 PM

I was able to update the tags

Dan DC

03/02/2022, 4:41 PM

I've rebalanced the table again but nothing happens

Mayank

03/02/2022, 4:42 PM

Hmm if tags are updated then rebalance would move them

Dan DC

03/02/2022, 4:42 PM

Actually, I see more segments in bad state than before. I don't think is related to what I've just done though

Mayank

03/02/2022, 4:42 PM

Can you confirm the tag that you see on instances

Dan DC

03/02/2022, 4:44 PM

Before untagging all were using DefaultTenant_OFFLINE and DefaultTenant_ONLINE

Dan DC

03/02/2022, 4:44 PM

I've replaced that with "removed" for all the dead servers

Mayank

03/02/2022, 4:48 PM

Do you see the new tag on the instances though

Mayank

03/02/2022, 4:48 PM

Also did you try rebalance with downtime

Dan DC

03/02/2022, 4:49 PM

Yes the servers are tagged as expected

Dan DC

03/02/2022, 4:49 PM

I've just tried with downtime and it works but I don't understand why

Dan DC

03/02/2022, 4:50 PM

I still have 3 servers running

Dan DC

03/02/2022, 4:50 PM

Downtime is supposed to be used when a single instance is running?

Dan DC

03/02/2022, 4:51 PM

For offline tables that doesn't work

Dan DC

03/02/2022, 5:08 PM

I've managed to removed the servers by calling the endpoint as soon as I trigger rebalance on the table

Dan DC

03/02/2022, 5:09 PM

One table would refresh its ideal state correctly but little after it would refer to the dead nodes again

Dan DC

03/02/2022, 5:10 PM

Some segments still report bad status, I'll leave like this for a while

Dan DC

03/02/2022, 5:10 PM

I just noticed that someone above had a similar issue

Dan DC

03/02/2022, 5:12 PM

This is raising some questions, the fact that a node shuts down should mean that the segments it holds should now be served by another node transparently, is that not how it's supposed to work?

Mayank

03/02/2022, 5:23 PM

Totally understand this is not good experience. Let me dm you to help see what is going on

Dan DC

03/02/2022, 7:50 PM

I've raised a github issue summarizing the problem at https://github.com/apache/pinot/issues/8281 please let me know if it needs more details - I'll try to include some logs. Thank you for the time you've spent looking at this problem

Mayank

03/02/2022, 7:53 PM

Thanks @Dan DC. To close the loop, we mitigated the issue by restarting controllers.

Open in Slack

Previous Next