Hi team, I'm experiencing some issues in pinot `0....
# troubleshooting
s
Hi team, I'm experiencing some issues in pinot
0.10.0
trying to rebalance an offline table after some offline-servers reached
dead
state (and have been replaced with new, healthy nodes), was hoping to get some extra đź‘€ on it.
âś… 1
I started by dropping
tags
on the
dead
servers. I can see the
ideal state
of the table still contains the
dead
servers. I attempt to
rebalance servers
w/
reassign instances
+ no
downtime
- see the new proposed
targetAssignment
removes the
dead
nodes. Can see in
controller
logs seem initially healthy before being blocked in this wait and eventually timeout:
Copy code
WARN [TableRebalancer] [jersey-server-managed-async-executor-6] Caught exception while waiting for ExternalView to converge for table: daily_user_metrics_by_channel_enterprise_bucketed_OFFLINE, aborting the rebalance
java.util.concurrent.TimeoutException: Timeout while waiting for ExternalView to converge
Any tips on further debugging why
ExternalView
is not converging on
IdealState
?
Also, at what point is the updated
Ideal State
persisted in Zookeeper? I still see zookeeper holding the
Ideal State
that contains the
dead
nodes after running rebalance.
Wondering if
ExternalView
is failing to converge on
IdealState
due to
IdealState
failing to persist its update and still referencing
dead
nodes? 🤔
m
Has EV converged with IS now @Scott deRegt?
s
It has not. Due to the Timeout exception on
rebalance
operation's
waitForExternalViewToConverge
, IS is not getting updated. Therefore, current IS still includes the
dead
nodes, so impossible for EV to converge on it.
m
What’s the replication?
s
I have tested with 2 tables, one with
replication
of
2
and the other with
3
.
m
And both have this issue?
s
Yes that's correct. Logs above are for table with rep factor of
3
. I also confirmed for this table that the 2
dead
offline servers in the cluster are in different logical instance assignment groups, meaning every segment of this table has at least 2/3 replicated segments available.
For
Rebalance Servers
, using a
Minimum Available Replicas
=
1
m
The code you pointed to is not for writing the ideal state of rebalance, it is to detect the change in ideal state when rebalance was happening. So IS should have been updated with removed dead nodes.
In the dry run do the dead nodes appear in the new IS returned?
Also just want to ensure you are following the following for removing dead nodes? https://docs.pinot.apache.org/operators/operating-pinot/rebalance/rebalance-servers
s
In dry run, the
dead
nodes do not appear in the new IS returned. Similarly, in controller logs, the
targetAssignment
does not include the
dead
nodes in this log message when running w/ dry run = false.
m
What’s the current IS and EV, and do they differ (if so, what’s the difference)? Wondering if they differ even before the rebalance (for some reason) and unable to converge which causes timeout even before new rebalanced IS could be applied
s
The only difference is the IS still contains the
dead
nodes and the EV does not. I do not see any segments in
ERROR
status.
Will DM you sample diff between IS and EV.
m
sounds good
In the meanwhile, if this is for testing, could you try with downtime? I am wondering if the code thinks there is no way to move forward without downtime?
s
Yes, could test with downtime in this case. We do want to have a good handle on how to recover from a lost server node (expecting this to be fairly common occurrence) w/o downtime as we get ready to go live in a production env.
m
After offline sync, here’s what’s happening: • In 0.10, the rebalance code first waits for existing EV to converge with existing (pre-rebalance) IS to drain any prior changes, before starting rebalance. • In case a node goes down, IS and EV won’t converge and the code times out. In 0.11, this has been addressed as follows: • When rebalancing, choose “downtime” as well as the newly added options of “minimizeDataMovement”. • If the dead servers have been replaced with healthy ones and tagged appropriately before above, this will do the rebalance without downtime (even though you had to choose it).
s
So in
0.10
, there is no way to recover and re-achieve
good
table status after a lost
server
node w/ no downtime?
Does this also require
bootstrap
to be toggled so that we start with an empty instance assignment and ignore current pre-rebalance IS?
m
There might be a way, let me dm.
s
Leaving this here for reference, the other best practice / proposed solution by @~Kishore is logically naming servernames consistently such that if there is a hardware failure and a node is replaced, then the new node will have the same logical server name, such that the existing Ideal State remains valid and the External View is able to converge on it.
âž• 1
l
Hi @Scott deRegt did you find a way to do this ? im facing the same issue now
s
In our case, upgrading to pinot `0.11.0`+ and doing a table rebalance with
downtime=true
and
minimizeDataMovement=true
worked for us (as mentioned here). Without more information on your cluster setup, I cannot confidently say whether or not that will solve your problem though.