hey friends i want to run your thoughts thru somet...
# troubleshooting
l
hey friends i want to run your thoughts thru something I have been doing some chaos exercises in pinot to see how it reacts this is my current scenario: Chaos exercise in pinot: System config: 1 minion, 2 servers, 2 brokers, 3 controllers, 3 zookeepers, data replication 2, backup gcs, environment GKE Scenario: downsize to 1 server, remove server pvc, see impact, try to go back to normal. (2 servers) Steps: 1. Downsize server to 1 with kubectl scale 2. Remove pvc in server 1 with kubectl delete pvc 3. Observation: p99 response time in system still strong not noticeable changes 4. Upsize back to 2 with kubectl scale 5. Observation: things don’t kick in automatically it seems like there’s some manual steps I have to do, don’t see new server consuming and having data pulled from gcs, still see the old server in the servers UI in the pinot-controller, it seems like I need to run a rebalance at this point 6. Update offline and online tags from old server with endpoint in pinot-controller 7. Seems like we can issue a rebalance now 8. Issuing with following: dryRun=false, reassignInstances=true, includeConsuming=false, bootstrap=true, downtime=false, minAvailableReplicas=true, bestEfforts=false 9. Observation: not seeing noticeable changes in p99 response time At this point the second instance is still not in a great state and not consuming, however the system is okay performing still at ms for p99s I’m wondering the following: Question: • What to look for when a rebalance is done in the pinot-controller-logs? • When to delete the old server tag? Do I need to also issue a updateBrokerResource, I try to delete but it says that Instance Server_10.12.64.88_8098 exists in ideal state for table and it doesn’t let me drop, at this point I cannot see the tables in the UI • Any other thing I should have done while rebalancing?
m
I think there are two separate things to evaluate:
Copy code
- What happens when some nodes go down for some time and get back up (either auto or through human intervention. In this scenario, you should just disable the node (rest api), and then re-enable it. Note, no rebalance needed.

- What happens if some nodes need to be physically replaced (non k8s), or add new nodes, or remove old ones. This is less of a chaos test, but more of operations test. This one requires steps on how to correctly add/remove instances and rebalance etc.
l
oh I guess that i’m doing more of a operation exercise when i’m trying to go to 2 nodes again? and usually what i have been doing is scale to down and up and things usually work
tho in the server i went the extra mile of getting rid of the pvc
basically the thing that’s happening right now is that for some reason that old server is still around and the new one feels like it’s not in the rotation
e
Did you try to remove the instance and it failed? You would have to rebalance the tables on it, then restart it (in the worst case), then delete the instance. Is that one of the issues you have?
Sorry - also have to untag the server instance first
l
like on that particular instance? but that instance is now gone
there are 2 new ones
yep and i did the untaging
i gave it another tag basically using the update tag endpoint
m
e
1 point about untagging: you have to update the instance config if you are removing tags iirc the update tags only takes non-null tags
l
hey yea I was reading that, this takes me to a weird place tho, but this is basically what i have been doing
and @Elon that’s what i did? and i see the old server with the old tags but i still cannot delete it cause of the above error
Copy code
Failed to drop instance Server_10.12.64.100_8098 - Instance Server_10.12.64.100_8098 exists in ideal state for metrics_OFFLINE
Copy code
{
  "id": "Server_10.12.64.100_8098",
  "simpleFields": {
    "HELIX_ENABLED": "true",
    "HELIX_ENABLED_TIMESTAMP": "1646853848603",
    "HELIX_HOST": "10.12.64.88",
    "HELIX_PORT": "8098",
    "adminPort": "8097",
    "shutdownInProgress": "true"
  },
  "mapFields": {
    "SYSTEM_RESOURCE_INFO": {
      "numCores": "4",
      "totalMemoryMB": "32768",
      "maxHeapSizeMB": "13312"
    }
  },
  "listFields": {
    "TAG_LIST": [
      "OldDefaultTenant_OFFLINE",
      "OldDefaultTenant_REALTIME"
    ]
  }
}
this is what i see in configs in the zookeeper explorer
right now the EXTERNALVIEW <> from IDEALSTATES for sure for the tables
looking for some help here as the dead node hasn't been able to be deleted and the UI is still blocked on the realtime table, also I have questions about the new server and the rebalance operation as I don't see the data that I would expect there
m
If server exists in ideal state then that means that it wasn’t untagged and rebalanced correctly.
Can you check if the instance still has the tag?
l
Copy code
{
  "instanceName": "Server_10.12.64.88_8098",
  "hostName": "10.12.64.88",
  "enabled": true,
  "port": "8098",
  "tags": [
    "OldDefaultTenant_OFFLINE",
    "OldDefaultTenant_REALTIME"
  ],
  "pools": null,
  "grpcPort": -1,
  "adminPort": 8097,
  "systemResourceInfo": {
    "numCores": "4",
    "totalMemoryMB": "32768",
    "maxHeapSizeMB": "13312"
  }
}
tha tag is not the same as the default one, I updated it
so it’s OldDefaultTenant now
@Jackie
j
Can you try rebalancing with downtime on? Some segments might be in ERROR or OFFLINE state on the server to be removed
You can verify that by comparing the IDEAL_STATE and EXTERNAL_VIEW of this table
l
in the zookeeper browser, EXTERNALVIEW and IDEALSTATES yes?
everything looks okay, no ERROR or OFFLINE segments
(also this is a hybrid table so i checked both OFFLINE and REALTIME)
do this options make sense to you? dryRun=false, reassignInstances=true, includeConsuming=false, bootstrap=true, downtime=true, minAvailableReplicas=true, bestEfforts=false
j
Yes, it should work
To be safe, you may try
dryRun
mode first and see if the assignment is expected
l
yea assignment looks good to me, let me try to run this
and will do in both REALTIME and OFFLINE
this seems really promising so far
image.png
i can see segment count and documents count going up in the newer server
spike in p99 but nothing bad
image.png
y axis is ms
(those are response times)
j
For the realtime table, in order to immediately remove all the segments from the old server, you should also set
includeConsuming
to true
👀 1
l
Copy code
Finished rebalancing table: metrics_OFFLINE with downtime in 63ms.
I did find this exception lurking i think it’s from before this but do you have any clue why this may pop?
Copy code
Caught exception while waiting for ExternalView to converge for table: m
etrics_REALTIME, aborting the rebalance
java.util.concurrent.TimeoutException: Timeout while waiting for ExternalView to
 converge
        at org.apache.pinot.controller.helix.core.rebalance.TableRebalancer.wait
ForExternalViewToConverge(TableRebalancer.java:523) ~[pinot-all-0.10.0-SNAPSHOT-
jar-with-dependencies.jar:0.10.0-SNAPSHOT-b7c181a77289fccb10cea139a097efb5d82f63
4a]
        at org.apache.pinot.controller.helix.core.rebalance.TableRebalancer.reba
lance(TableRebalancer.java:308) ~[pinot-all-0.10.0-SNAPSHOT-jar-with-dependencie
s.jar:0.10.0-SNAPSHOT-b7c181a77289fccb10cea139a097efb5d82f634a]
        at org.apache.pinot.controller.helix.core.PinotHelixResourceManager.reba
lanceTable(PinotHelixResourceManager.java:2793) ~[pinot-all-0.10.0-SNAPSHOT-jar-
with-dependencies.jar:0.10.0-SNAPSHOT-b7c181a77289fccb10cea139a097efb5d82f634a]
        at org.apache.pinot.controller.api.resources.PinotTableRestletResource.lambda$rebalance$3(PinotTableRestletResource.java:601) ~[pinot-all-0.10.0-SNAPSHOT-jar-with-dependencies.jar:0.10.0-SNAPSHOT-b7c181a77289fccb10cea139a097efb5d82f634a]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:829) [?:?]
omg this is great
image.png
same segments same documents :’)
i can finally access the
Tables
UI again
ok now i’m gonna get rid of the old instance with the delete instance endpoint
i’m tearing up a little
Copy code
{
  "status": "Successfully dropped instance"
}
I guess I have 2 questions, why couldn’t the rebalance work with no downtime and what’s that weird exception