We are running pinot in kubernetes and noticed that the serv Apache Pinot #troubleshooting

We are running pinot in kubernetes, and noticed th...

Zsolt Takacs

09/07/2021, 3:51 PM

We are running pinot in kubernetes, and noticed that the servers are considered ready too early, before the server has managed to start. This causes the statefulset rolling restart to restart multiple servers simultaneously, making segments inaccessible. The server api

/health

endpoint should be used for readiness probing?

Mayank

09/07/2021, 3:52 PM

Broker routes the query to a server for only segments that are online.

Zsolt Takacs

09/07/2021, 3:52 PM

In our case 7 out of 8 servers were restarting at the same time

Mayank

09/07/2021, 3:57 PM

Are you using replica groups? If so, you could do one replica at a time?

Zsolt Takacs

09/07/2021, 4:12 PM

we are not using it

Zsolt Takacs

09/07/2021, 4:12 PM

and we are doing helm upgrades for config changes, so it's not done manually

Mayank

09/07/2021, 6:45 PM

@Xiang Fu Any suggestions? IIRC, there are deployments that have hooks that wait for sometime (x minutes) before reporting healthy? cc: @Jackie

Jackie

09/07/2021, 6:52 PM

Which version of pinot are you running? How do you shut down the servers? We need to ensure the shut down hook is called when shutting down the servers

Zsolt Takacs

09/07/2021, 7:54 PM

Running 0.7.1 with the helm chart from the repo. When we do a helm upgrade (i.e. last time I've configured s3 retries for the Servers), the pods are restarted by the StatefulSet controller, using the default RollingUpdate strategy. The controller waits for the restarted pod to be Ready, then proceeds to restart the next one. The standard kubernetes termination is SIGTERM followed by SIGKILL after 30s if not terminated.

Zsolt Takacs

09/07/2021, 7:56 PM

In the chart the Brokers have the /health readiness probe, that's why I'm wondering why the Servers don't have it set.

Jackie

09/07/2021, 8:31 PM

Here is a fix for adding the shutdown hook for the server: https://github.com/apache/pinot/pull/7251

Jackie

09/07/2021, 8:32 PM

Seems it is not included in

0.8.0

, so you need to try either the current master or wait for the next release

Jackie

09/07/2021, 8:32 PM

Adding @Xiang Fu to take a look as well

Zsolt Takacs

09/08/2021, 5:31 AM

~~Looking into the stop method it seems that the shutdown resource check is disabled while the comments suggest it should be enabled by default since the start check is enabled by default:~~

Copy code

// Shutdown: enable resource check before shutting down the server
    //           Will wait until all the resources in the external view are neither ONLINE nor CONSUMING
    //           No need to enable this check if startup service status check is enabled
    public static final String CONFIG_OF_SHUTDOWN_ENABLE_RESOURCE_CHECK = "pinot.server.shutdown.enableResourceCheck";
    public static final boolean DEFAULT_SHUTDOWN_ENABLE_RESOURCE_CHECK = false;

Zsolt Takacs

09/08/2021, 5:57 AM

Oh nevermind, I read that comment wrong.

Zsolt Takacs

09/08/2021, 1:22 PM

I'm looking into the docs for the replica groups, and can't find how the servers are assigned to replica groups.

Jackie

09/08/2021, 5:17 PM

Please refer to this page: https://docs.pinot.apache.org/operators/operating-pinot/instance-assignment

Zsolt Takacs

09/08/2021, 5:23 PM

From that example I can't see how I could control which servers are assigned which replica group

Zsolt Takacs

09/08/2021, 5:25 PM

I'd like to run multiple statefulsets of servers, and assign the replica groups to different statefulsets so they can be restarted without disrupting multiple replicas of any segments

Zsolt Takacs

09/08/2021, 5:41 PM

Now I see the pool based one is what I'm looking for, thanks!

Mayank

09/09/2021, 4:47 PM

Pool based is more for a giant multitenant cluster with hundreds of tables. Replica group is a simple concept.

Mayank

09/09/2021, 4:47 PM

https://docs.pinot.apache.org/operators/operating-pinot/segment-assignment#replica-group-segment-assignment

Mayank

09/09/2021, 4:50 PM

And: https://docs.pinot.apache.org/operators/operating-pinot/instance-assignment#replica-group-instance-assignment

Zsolt Takacs

09/09/2021, 5:05 PM

In this case the assignment to servers is arbitrary or can it be controlled?

Zsolt Takacs

09/09/2021, 5:06 PM

Per my current understanding two pinot helm deployments could be run with the same zookeeper and cluster name, and the servers assigned to different pools in each. Then the deployments could be upgraded separately without downtime.

Zsolt Takacs

09/09/2021, 5:08 PM

Also the docs mention the recommended upgrade sequence (changing pinot version) of controller, broker, etc, this could make a further split necessary, since I don't think helm guarantees any order of upgrading these statefulsets.

Open in Slack

Previous Next