Luis Fernandez
03/17/2022, 9:13 PMMayank
- What happens when some nodes go down for some time and get back up (either auto or through human intervention. In this scenario, you should just disable the node (rest api), and then re-enable it. Note, no rebalance needed.
- What happens if some nodes need to be physically replaced (non k8s), or add new nodes, or remove old ones. This is less of a chaos test, but more of operations test. This one requires steps on how to correctly add/remove instances and rebalance etc.
Luis Fernandez
03/17/2022, 9:24 PMLuis Fernandez
03/17/2022, 9:24 PMLuis Fernandez
03/17/2022, 9:47 PMElon
03/17/2022, 10:07 PMElon
03/17/2022, 10:07 PMLuis Fernandez
03/18/2022, 12:27 AMLuis Fernandez
03/18/2022, 12:28 AMLuis Fernandez
03/18/2022, 12:28 AMLuis Fernandez
03/18/2022, 12:28 AMMayank
Elon
03/18/2022, 1:43 AMLuis Fernandez
03/18/2022, 2:35 PMLuis Fernandez
03/18/2022, 2:39 PMFailed to drop instance Server_10.12.64.100_8098 - Instance Server_10.12.64.100_8098 exists in ideal state for metrics_OFFLINE
Luis Fernandez
03/18/2022, 2:40 PM{
"id": "Server_10.12.64.100_8098",
"simpleFields": {
"HELIX_ENABLED": "true",
"HELIX_ENABLED_TIMESTAMP": "1646853848603",
"HELIX_HOST": "10.12.64.88",
"HELIX_PORT": "8098",
"adminPort": "8097",
"shutdownInProgress": "true"
},
"mapFields": {
"SYSTEM_RESOURCE_INFO": {
"numCores": "4",
"totalMemoryMB": "32768",
"maxHeapSizeMB": "13312"
}
},
"listFields": {
"TAG_LIST": [
"OldDefaultTenant_OFFLINE",
"OldDefaultTenant_REALTIME"
]
}
}
Luis Fernandez
03/18/2022, 2:40 PMLuis Fernandez
03/18/2022, 3:17 PMLuis Fernandez
03/21/2022, 1:09 PMMayank
Mayank
Luis Fernandez
03/21/2022, 2:28 PM{
"instanceName": "Server_10.12.64.88_8098",
"hostName": "10.12.64.88",
"enabled": true,
"port": "8098",
"tags": [
"OldDefaultTenant_OFFLINE",
"OldDefaultTenant_REALTIME"
],
"pools": null,
"grpcPort": -1,
"adminPort": 8097,
"systemResourceInfo": {
"numCores": "4",
"totalMemoryMB": "32768",
"maxHeapSizeMB": "13312"
}
}
Luis Fernandez
03/21/2022, 2:28 PMLuis Fernandez
03/21/2022, 2:29 PMLuis Fernandez
03/21/2022, 2:29 PMLuis Fernandez
03/21/2022, 5:18 PMJackie
03/21/2022, 5:24 PMJackie
03/21/2022, 5:25 PMLuis Fernandez
03/21/2022, 5:28 PMLuis Fernandez
03/21/2022, 5:28 PMLuis Fernandez
03/21/2022, 5:28 PMLuis Fernandez
03/21/2022, 5:31 PMJackie
03/21/2022, 6:47 PMJackie
03/21/2022, 6:47 PMdryRun
mode first and see if the assignment is expectedLuis Fernandez
03/21/2022, 6:48 PMLuis Fernandez
03/21/2022, 6:52 PMLuis Fernandez
03/21/2022, 6:58 PMLuis Fernandez
03/21/2022, 6:58 PMLuis Fernandez
03/21/2022, 6:59 PMLuis Fernandez
03/21/2022, 6:59 PMLuis Fernandez
03/21/2022, 6:59 PMLuis Fernandez
03/21/2022, 7:00 PMLuis Fernandez
03/21/2022, 7:00 PMJackie
03/21/2022, 7:03 PMincludeConsuming
to trueLuis Fernandez
03/21/2022, 7:05 PMFinished rebalancing table: metrics_OFFLINE with downtime in 63ms.
Luis Fernandez
03/21/2022, 7:09 PMCaught exception while waiting for ExternalView to converge for table: m
etrics_REALTIME, aborting the rebalance
java.util.concurrent.TimeoutException: Timeout while waiting for ExternalView to
converge
at org.apache.pinot.controller.helix.core.rebalance.TableRebalancer.wait
ForExternalViewToConverge(TableRebalancer.java:523) ~[pinot-all-0.10.0-SNAPSHOT-
jar-with-dependencies.jar:0.10.0-SNAPSHOT-b7c181a77289fccb10cea139a097efb5d82f63
4a]
at org.apache.pinot.controller.helix.core.rebalance.TableRebalancer.reba
lance(TableRebalancer.java:308) ~[pinot-all-0.10.0-SNAPSHOT-jar-with-dependencie
s.jar:0.10.0-SNAPSHOT-b7c181a77289fccb10cea139a097efb5d82f634a]
at org.apache.pinot.controller.helix.core.PinotHelixResourceManager.reba
lanceTable(PinotHelixResourceManager.java:2793) ~[pinot-all-0.10.0-SNAPSHOT-jar-
with-dependencies.jar:0.10.0-SNAPSHOT-b7c181a77289fccb10cea139a097efb5d82f634a]
at org.apache.pinot.controller.api.resources.PinotTableRestletResource.lambda$rebalance$3(PinotTableRestletResource.java:601) ~[pinot-all-0.10.0-SNAPSHOT-jar-with-dependencies.jar:0.10.0-SNAPSHOT-b7c181a77289fccb10cea139a097efb5d82f634a]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:829) [?:?]
Luis Fernandez
03/21/2022, 7:11 PMLuis Fernandez
03/21/2022, 7:12 PMLuis Fernandez
03/21/2022, 7:12 PMLuis Fernandez
03/21/2022, 7:14 PMTables
UI againLuis Fernandez
03/21/2022, 7:26 PMLuis Fernandez
03/21/2022, 7:30 PM{
"status": "Successfully dropped instance"
}
Luis Fernandez
03/21/2022, 7:31 PM