Scott deRegt
10/04/2022, 3:45 PM0.10.0
trying to rebalance an offline table after some offline-servers reached dead
state (and have been replaced with new, healthy nodes), was hoping to get some extra đź‘€ on it.Scott deRegt
10/04/2022, 3:46 PMtags
on the dead
servers. I can see the ideal state
of the table still contains the dead
servers. I attempt to rebalance servers
w/ reassign instances
+ no downtime
- see the new proposed targetAssignment
removes the dead
nodes.
Can see in controller
logs seem initially healthy before being blocked in this wait and eventually timeout:
WARN [TableRebalancer] [jersey-server-managed-async-executor-6] Caught exception while waiting for ExternalView to converge for table: daily_user_metrics_by_channel_enterprise_bucketed_OFFLINE, aborting the rebalance
java.util.concurrent.TimeoutException: Timeout while waiting for ExternalView to converge
Scott deRegt
10/04/2022, 3:48 PMExternalView
is not converging on IdealState
?Scott deRegt
10/04/2022, 3:55 PMIdeal State
persisted in Zookeeper? I still see zookeeper holding the Ideal State
that contains the dead
nodes after running rebalance.Scott deRegt
10/04/2022, 3:58 PMExternalView
is failing to converge on IdealState
due to IdealState
failing to persist its update and still referencing dead
nodes? 🤔Scott deRegt
10/04/2022, 5:03 PMMayank
Scott deRegt
10/04/2022, 5:53 PMrebalance
operation's waitForExternalViewToConverge
, IS is not getting updated. Therefore, current IS still includes the dead
nodes, so impossible for EV to converge on it.Mayank
Scott deRegt
10/04/2022, 5:55 PMreplication
of 2
and the other with 3
.Mayank
Scott deRegt
10/04/2022, 5:58 PM3
. I also confirmed for this table that the 2 dead
offline servers in the cluster are in different logical instance assignment groups, meaning every segment of this table has at least 2/3 replicated segments available.Scott deRegt
10/04/2022, 6:00 PMRebalance Servers
, using a Minimum Available Replicas
= 1
Mayank
Mayank
Mayank
Scott deRegt
10/04/2022, 6:12 PMdead
nodes do not appear in the new IS returned.
Similarly, in controller logs, the targetAssignment
does not include the dead
nodes in this log message when running w/ dry run = false.Mayank
Scott deRegt
10/04/2022, 6:21 PMdead
nodes and the EV does not. I do not see any segments in ERROR
status.Scott deRegt
10/04/2022, 6:23 PMMayank
Mayank
Scott deRegt
10/04/2022, 6:31 PMMayank
Scott deRegt
10/04/2022, 6:32 PM0.10
, there is no way to recover and re-achieve good
table status after a lost server
node w/ no downtime?Scott deRegt
10/04/2022, 6:36 PMbootstrap
to be toggled so that we start with an empty instance assignment and ignore current pre-rebalance IS?Mayank
Scott deRegt
10/04/2022, 8:31 PMLee Wei Hern Jason
01/09/2024, 11:09 AMScott deRegt
01/09/2024, 4:39 PMdowntime=true
and minimizeDataMovement=true
worked for us (as mentioned here). Without more information on your cluster setup, I cannot confidently say whether or not that will solve your problem though.