Luis Fernandez
03/17/2022, 9:13 PMMayank
- What happens when some nodes go down for some time and get back up (either auto or through human intervention. In this scenario, you should just disable the node (rest api), and then re-enable it. Note, no rebalance needed.
- What happens if some nodes need to be physically replaced (non k8s), or add new nodes, or remove old ones. This is less of a chaos test, but more of operations test. This one requires steps on how to correctly add/remove instances and rebalance etc.Luis Fernandez
03/17/2022, 9:24 PMLuis Fernandez
03/17/2022, 9:24 PMLuis Fernandez
03/17/2022, 9:47 PMElon
03/17/2022, 10:07 PMElon
03/17/2022, 10:07 PMLuis Fernandez
03/18/2022, 12:27 AMLuis Fernandez
03/18/2022, 12:28 AMLuis Fernandez
03/18/2022, 12:28 AMLuis Fernandez
03/18/2022, 12:28 AMMayank
Elon
03/18/2022, 1:43 AMLuis Fernandez
03/18/2022, 2:35 PMLuis Fernandez
03/18/2022, 2:39 PMFailed to drop instance Server_10.12.64.100_8098 - Instance Server_10.12.64.100_8098 exists in ideal state for metrics_OFFLINELuis Fernandez
03/18/2022, 2:40 PM{
  "id": "Server_10.12.64.100_8098",
  "simpleFields": {
    "HELIX_ENABLED": "true",
    "HELIX_ENABLED_TIMESTAMP": "1646853848603",
    "HELIX_HOST": "10.12.64.88",
    "HELIX_PORT": "8098",
    "adminPort": "8097",
    "shutdownInProgress": "true"
  },
  "mapFields": {
    "SYSTEM_RESOURCE_INFO": {
      "numCores": "4",
      "totalMemoryMB": "32768",
      "maxHeapSizeMB": "13312"
    }
  },
  "listFields": {
    "TAG_LIST": [
      "OldDefaultTenant_OFFLINE",
      "OldDefaultTenant_REALTIME"
    ]
  }
}Luis Fernandez
03/18/2022, 2:40 PMLuis Fernandez
03/18/2022, 3:17 PMLuis Fernandez
03/21/2022, 1:09 PMMayank
Mayank
Luis Fernandez
03/21/2022, 2:28 PM{
  "instanceName": "Server_10.12.64.88_8098",
  "hostName": "10.12.64.88",
  "enabled": true,
  "port": "8098",
  "tags": [
    "OldDefaultTenant_OFFLINE",
    "OldDefaultTenant_REALTIME"
  ],
  "pools": null,
  "grpcPort": -1,
  "adminPort": 8097,
  "systemResourceInfo": {
    "numCores": "4",
    "totalMemoryMB": "32768",
    "maxHeapSizeMB": "13312"
  }
}Luis Fernandez
03/21/2022, 2:28 PMLuis Fernandez
03/21/2022, 2:29 PMLuis Fernandez
03/21/2022, 2:29 PMLuis Fernandez
03/21/2022, 5:18 PMJackie
03/21/2022, 5:24 PMJackie
03/21/2022, 5:25 PMLuis Fernandez
03/21/2022, 5:28 PMLuis Fernandez
03/21/2022, 5:28 PMLuis Fernandez
03/21/2022, 5:28 PMLuis Fernandez
03/21/2022, 5:31 PMJackie
03/21/2022, 6:47 PMJackie
03/21/2022, 6:47 PMdryRunLuis Fernandez
03/21/2022, 6:48 PMLuis Fernandez
03/21/2022, 6:52 PMLuis Fernandez
03/21/2022, 6:58 PMLuis Fernandez
03/21/2022, 6:58 PMLuis Fernandez
03/21/2022, 6:59 PMLuis Fernandez
03/21/2022, 6:59 PMLuis Fernandez
03/21/2022, 6:59 PMLuis Fernandez
03/21/2022, 7:00 PMLuis Fernandez
03/21/2022, 7:00 PMJackie
03/21/2022, 7:03 PMincludeConsumingLuis Fernandez
03/21/2022, 7:05 PMFinished rebalancing table: metrics_OFFLINE with downtime in 63ms.Luis Fernandez
03/21/2022, 7:09 PMCaught exception while waiting for ExternalView to converge for table: m
etrics_REALTIME, aborting the rebalance
java.util.concurrent.TimeoutException: Timeout while waiting for ExternalView to
 converge
        at org.apache.pinot.controller.helix.core.rebalance.TableRebalancer.wait
ForExternalViewToConverge(TableRebalancer.java:523) ~[pinot-all-0.10.0-SNAPSHOT-
jar-with-dependencies.jar:0.10.0-SNAPSHOT-b7c181a77289fccb10cea139a097efb5d82f63
4a]
        at org.apache.pinot.controller.helix.core.rebalance.TableRebalancer.reba
lance(TableRebalancer.java:308) ~[pinot-all-0.10.0-SNAPSHOT-jar-with-dependencie
s.jar:0.10.0-SNAPSHOT-b7c181a77289fccb10cea139a097efb5d82f634a]
        at org.apache.pinot.controller.helix.core.PinotHelixResourceManager.reba
lanceTable(PinotHelixResourceManager.java:2793) ~[pinot-all-0.10.0-SNAPSHOT-jar-
with-dependencies.jar:0.10.0-SNAPSHOT-b7c181a77289fccb10cea139a097efb5d82f634a]
        at org.apache.pinot.controller.api.resources.PinotTableRestletResource.lambda$rebalance$3(PinotTableRestletResource.java:601) ~[pinot-all-0.10.0-SNAPSHOT-jar-with-dependencies.jar:0.10.0-SNAPSHOT-b7c181a77289fccb10cea139a097efb5d82f634a]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:829) [?:?]Luis Fernandez
03/21/2022, 7:11 PMLuis Fernandez
03/21/2022, 7:12 PMLuis Fernandez
03/21/2022, 7:12 PMLuis Fernandez
03/21/2022, 7:14 PMTablesLuis Fernandez
03/21/2022, 7:26 PMLuis Fernandez
03/21/2022, 7:30 PM{
  "status": "Successfully dropped instance"
}Luis Fernandez
03/21/2022, 7:31 PM