hey friends i want to run your thoughts thru something I hav Apache Pinot #troubleshooting

hey friends i want to run your thoughts thru somet...

Luis Fernandez

03/17/2022, 9:13 PM

hey friends i want to run your thoughts thru something I have been doing some chaos exercises in pinot to see how it reacts this is my current scenario: Chaos exercise in pinot: System config: 1 minion, 2 servers, 2 brokers, 3 controllers, 3 zookeepers, data replication 2, backup gcs, environment GKE Scenario: downsize to 1 server, remove server pvc, see impact, try to go back to normal. (2 servers) Steps: 1. Downsize server to 1 with kubectl scale 2. Remove pvc in server 1 with kubectl delete pvc 3. Observation: p99 response time in system still strong not noticeable changes 4. Upsize back to 2 with kubectl scale 5. Observation: things don’t kick in automatically it seems like there’s some manual steps I have to do, don’t see new server consuming and having data pulled from gcs, still see the old server in the servers UI in the pinot-controller, it seems like I need to run a rebalance at this point 6. Update offline and online tags from old server with endpoint in pinot-controller 7. Seems like we can issue a rebalance now 8. Issuing with following: dryRun=false, reassignInstances=true, includeConsuming=false, bootstrap=true, downtime=false, minAvailableReplicas=true, bestEfforts=false 9. Observation: not seeing noticeable changes in p99 response time At this point the second instance is still not in a great state and not consuming, however the system is okay performing still at ms for p99s I’m wondering the following: Question: • What to look for when a rebalance is done in the pinot-controller-logs? • When to delete the old server tag? Do I need to also issue a updateBrokerResource, I try to delete but it says that Instance Server_10.12.64.88_8098 exists in ideal state for table and it doesn’t let me drop, at this point I cannot see the tables in the UI • Any other thing I should have done while rebalancing?

Mayank

03/17/2022, 9:17 PM

I think there are two separate things to evaluate:

Copy code

- What happens when some nodes go down for some time and get back up (either auto or through human intervention. In this scenario, you should just disable the node (rest api), and then re-enable it. Note, no rebalance needed.

- What happens if some nodes need to be physically replaced (non k8s), or add new nodes, or remove old ones. This is less of a chaos test, but more of operations test. This one requires steps on how to correctly add/remove instances and rebalance etc.

Luis Fernandez

03/17/2022, 9:24 PM

oh I guess that i’m doing more of a operation exercise when i’m trying to go to 2 nodes again? and usually what i have been doing is scale to down and up and things usually work

Luis Fernandez

03/17/2022, 9:24 PM

tho in the server i went the extra mile of getting rid of the pvc

Luis Fernandez

03/17/2022, 9:47 PM

basically the thing that’s happening right now is that for some reason that old server is still around and the new one feels like it’s not in the rotation

Elon

03/17/2022, 10:07 PM

Did you try to remove the instance and it failed? You would have to rebalance the tables on it, then restart it (in the worst case), then delete the instance. Is that one of the issues you have?

Elon

03/17/2022, 10:07 PM

Sorry - also have to untag the server instance first

Luis Fernandez

03/18/2022, 12:27 AM

like on that particular instance? but that instance is now gone

Luis Fernandez

03/18/2022, 12:28 AM

there are 2 new ones

Luis Fernandez

03/18/2022, 12:28 AM

yep and i did the untaging

Luis Fernandez

03/18/2022, 12:28 AM

i gave it another tag basically using the update tag endpoint

Mayank

03/18/2022, 1:42 AM

This is a good resource on operations: https://docs.pinot.apache.org/basics/getting-started/frequent-questions/operations-faq#operations

🍷 1

Elon

03/18/2022, 1:43 AM

1 point about untagging: you have to update the instance config if you are removing tags iirc the update tags only takes non-null tags

Luis Fernandez

03/18/2022, 2:35 PM

hey yea I was reading that, this takes me to a weird place tho, but this is basically what i have been doing

Luis Fernandez

03/18/2022, 2:35 PM

and @Elon that’s what i did? and i see the old server with the old tags but i still cannot delete it cause of the above error

Luis Fernandez

03/18/2022, 2:39 PM

Copy code

Failed to drop instance Server_10.12.64.100_8098 - Instance Server_10.12.64.100_8098 exists in ideal state for metrics_OFFLINE

Luis Fernandez

03/18/2022, 2:40 PM

Copy code

{
  "id": "Server_10.12.64.100_8098",
  "simpleFields": {
    "HELIX_ENABLED": "true",
    "HELIX_ENABLED_TIMESTAMP": "1646853848603",
    "HELIX_HOST": "10.12.64.88",
    "HELIX_PORT": "8098",
    "adminPort": "8097",
    "shutdownInProgress": "true"
  },
  "mapFields": {
    "SYSTEM_RESOURCE_INFO": {
      "numCores": "4",
      "totalMemoryMB": "32768",
      "maxHeapSizeMB": "13312"
    }
  },
  "listFields": {
    "TAG_LIST": [
      "OldDefaultTenant_OFFLINE",
      "OldDefaultTenant_REALTIME"
    ]
  }
}

Luis Fernandez

03/18/2022, 2:40 PM

this is what i see in configs in the zookeeper explorer

Luis Fernandez

03/18/2022, 3:17 PM

right now the EXTERNALVIEW <> from IDEALSTATES for sure for the tables

Luis Fernandez

03/21/2022, 1:09 PM

looking for some help here as the dead node hasn't been able to be deleted and the UI is still blocked on the realtime table, also I have questions about the new server and the rebalance operation as I don't see the data that I would expect there

Mayank

03/21/2022, 2:23 PM

If server exists in ideal state then that means that it wasn’t untagged and rebalanced correctly.

Mayank

03/21/2022, 2:23 PM

Can you check if the instance still has the tag?

Luis Fernandez

03/21/2022, 2:28 PM

Copy code

{
  "instanceName": "Server_10.12.64.88_8098",
  "hostName": "10.12.64.88",
  "enabled": true,
  "port": "8098",
  "tags": [
    "OldDefaultTenant_OFFLINE",
    "OldDefaultTenant_REALTIME"
  ],
  "pools": null,
  "grpcPort": -1,
  "adminPort": 8097,
  "systemResourceInfo": {
    "numCores": "4",
    "totalMemoryMB": "32768",
    "maxHeapSizeMB": "13312"
  }
}

Luis Fernandez

03/21/2022, 2:28 PM

tha tag is not the same as the default one, I updated it

Luis Fernandez

03/21/2022, 2:29 PM

so it’s OldDefaultTenant now

Luis Fernandez

03/21/2022, 2:29 PM

I updated as shown here https://docs.pinot.apache.org/operators/operating-pinot/rebalance/rebalance-servers#updating-tags

Luis Fernandez

03/21/2022, 5:18 PM

@Jackie

Jackie

03/21/2022, 5:24 PM

Can you try rebalancing with downtime on? Some segments might be in ERROR or OFFLINE state on the server to be removed

Jackie

03/21/2022, 5:25 PM

You can verify that by comparing the IDEAL_STATE and EXTERNAL_VIEW of this table

Luis Fernandez

03/21/2022, 5:28 PM

in the zookeeper browser, EXTERNALVIEW and IDEALSTATES yes?

Luis Fernandez

03/21/2022, 5:28 PM

everything looks okay, no ERROR or OFFLINE segments

Luis Fernandez

03/21/2022, 5:28 PM

(also this is a hybrid table so i checked both OFFLINE and REALTIME)

Luis Fernandez

03/21/2022, 5:31 PM

do this options make sense to you? dryRun=false, reassignInstances=true, includeConsuming=false, bootstrap=true, downtime=true, minAvailableReplicas=true, bestEfforts=false

Jackie

03/21/2022, 6:47 PM

Yes, it should work

Jackie

03/21/2022, 6:47 PM

To be safe, you may try

dryRun

mode first and see if the assignment is expected

Luis Fernandez

03/21/2022, 6:48 PM

yea assignment looks good to me, let me try to run this

Luis Fernandez

03/21/2022, 6:52 PM

and will do in both REALTIME and OFFLINE

Luis Fernandez

03/21/2022, 6:58 PM

this seems really promising so far

Luis Fernandez

03/21/2022, 6:58 PM

image.png

Luis Fernandez

03/21/2022, 6:59 PM

i can see segment count and documents count going up in the newer server

Luis Fernandez

03/21/2022, 6:59 PM

spike in p99 but nothing bad

Luis Fernandez

03/21/2022, 6:59 PM

image.png

Luis Fernandez

03/21/2022, 7:00 PM

y axis is ms

Luis Fernandez

03/21/2022, 7:00 PM

(those are response times)

Jackie

03/21/2022, 7:03 PM

For the realtime table, in order to immediately remove all the segments from the old server, you should also set

includeConsuming

to true

👀 1

Luis Fernandez

03/21/2022, 7:05 PM

Copy code

Finished rebalancing table: metrics_OFFLINE with downtime in 63ms.

Luis Fernandez

03/21/2022, 7:09 PM

I did find this exception lurking i think it’s from before this but do you have any clue why this may pop?

Copy code

Caught exception while waiting for ExternalView to converge for table: m
etrics_REALTIME, aborting the rebalance
java.util.concurrent.TimeoutException: Timeout while waiting for ExternalView to
 converge
        at org.apache.pinot.controller.helix.core.rebalance.TableRebalancer.wait
ForExternalViewToConverge(TableRebalancer.java:523) ~[pinot-all-0.10.0-SNAPSHOT-
jar-with-dependencies.jar:0.10.0-SNAPSHOT-b7c181a77289fccb10cea139a097efb5d82f63
4a]
        at org.apache.pinot.controller.helix.core.rebalance.TableRebalancer.reba
lance(TableRebalancer.java:308) ~[pinot-all-0.10.0-SNAPSHOT-jar-with-dependencie
s.jar:0.10.0-SNAPSHOT-b7c181a77289fccb10cea139a097efb5d82f634a]
        at org.apache.pinot.controller.helix.core.PinotHelixResourceManager.reba
lanceTable(PinotHelixResourceManager.java:2793) ~[pinot-all-0.10.0-SNAPSHOT-jar-
with-dependencies.jar:0.10.0-SNAPSHOT-b7c181a77289fccb10cea139a097efb5d82f634a]
        at org.apache.pinot.controller.api.resources.PinotTableRestletResource.lambda$rebalance$3(PinotTableRestletResource.java:601) ~[pinot-all-0.10.0-SNAPSHOT-jar-with-dependencies.jar:0.10.0-SNAPSHOT-b7c181a77289fccb10cea139a097efb5d82f634a]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:829) [?:?]

Luis Fernandez

03/21/2022, 7:11 PM

omg this is great

Luis Fernandez

03/21/2022, 7:12 PM

image.png

Luis Fernandez

03/21/2022, 7:12 PM

same segments same documents :’)

Luis Fernandez

03/21/2022, 7:14 PM

i can finally access the

Tables

UI again

Luis Fernandez

03/21/2022, 7:26 PM

ok now i’m gonna get rid of the old instance with the delete instance endpoint

Luis Fernandez

03/21/2022, 7:30 PM

i’m tearing up a little

Copy code

{
  "status": "Successfully dropped instance"
}

Luis Fernandez

03/21/2022, 7:31 PM

I guess I have 2 questions, why couldn’t the rebalance work with no downtime and what’s that weird exception

Open in Slack

Previous Next