Hi folks I m trying to upgrade from release 0 10 0 to releas Apache Pinot #troubleshooting

Hi folks, I’m trying to upgrade from release-0.10....

Gerrit van Doorn

09/23/2022, 8:56 PM

Hi folks, I’m trying to upgrade from release-0.10.0 to release-0.11.0 and I’m greeted with the following errors:

Copy code

2022/09/23 20:53:11.493 WARN [ClientCnxn] [main-SendThread(localhost:2185)] Session 0x3e3d835c1a2346f7 for server localhost/127.0.0.1:2185, unexpected error, closing socket connection and attempting reconnect
java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:?]
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) ~[?:?]
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:276) ~[?:?]
        at sun.nio.ch.IOUtil.read(IOUtil.java:233) ~[?:?]
        at sun.nio.ch.IOUtil.read(IOUtil.java:223) ~[?:?]
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:356) ~[?:?]
        at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:75) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:363) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1223) [pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
2022/09/23 20:53:11.594 INFO [ZkClient] [main-EventThread] zkclient 5, zookeeper state changed ( Disconnected )
2022/09/23 20:53:12.441 INFO [ControllerResponseFilter] [grizzly-http-server-22] Handled request from 127.0.0.1 GET <http://localhost:9000/health>, content-type null status code 503 Service Unavailable

What could be the reason for this?

Gerrit van Doorn

09/23/2022, 8:56 PM

I also see:

Copy code

2022/09/23 20:50:47.431 INFO [CallbackHandler] [ZkClient-EventThread-98-localhost:2185] Callbackhandler org.apache.helix.manager.zk.CallbackHandler@e414c0f with path /PinotCluster/LIVEINSTANCES is in reset state. Stop subscription to ZK client to avoid leaking
2022/09/23 20:50:47.431 ERROR [GenericHelixController] [HelixController-pipeline-default-PinotCluster-(c341ab48_DEFAULT)] Exception while executing DEFAULT pipeline for cluster PinotCluster. Will not continue to next pipeline
org.apache.helix.zookeeper.zkclient.exception.ZkSessionMismatchedException: Failed to get expected zookeeper instance! There is a session id mismatch. Expected: 3e3d835c1a2344e3. Actual: 3e3d835c1a2346f7
        at org.apache.helix.zookeeper.zkclient.ZkClient.getExpectedZookeeper(ZkClient.java:2746) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.zookeeper.zkclient.ZkClient.lambda$doAsyncCreate$10(ZkClient.java:2257) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.zookeeper.zkclient.ZkClient.retryUntilConnected(ZkClient.java:1986) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.zookeeper.zkclient.ZkClient.doAsyncCreate(ZkClient.java:2256) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.zookeeper.zkclient.ZkClient.asyncCreate(ZkClient.java:2250) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.manager.zk.ZkBaseDataAccessor.create(ZkBaseDataAccessor.java:784) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.manager.zk.ZkBaseDataAccessor.createChildren(ZkBaseDataAccessor.java:884) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.manager.zk.ZkBaseDataAccessor.createChildren(ZkBaseDataAccessor.java:858) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.manager.zk.ZKHelixDataAccessor.createChildren(ZKHelixDataAccessor.java:519) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.controller.stages.MessageDispatchStage.sendMessages(MessageDispatchStage.java:187) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.controller.stages.MessageDispatchStage.processEvent(MessageDispatchStage.java:96) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.controller.stages.resource.ResourceMessageDispatchStage.process(ResourceMessageDispatchStage.java:33) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.controller.pipeline.Pipeline.handle(Pipeline.java:75) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.controller.GenericHelixController.handleEvent(GenericHelixController.java:903) [pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.controller.GenericHelixController.access$500(GenericHelixController.java:132) [pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.controller.GenericHelixController$ClusterEventProcessor.run(GenericHelixController.java:1554) [pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
2022/09/23 20:50:47.432 INFO [GenericHelixController] [HelixController-pipeline-default-PinotCluster-(c341ab48_DEFAULT)] END: Invoking DEFAULT controller pipeline for event ResourceConfigChange::c341ab48_DEFAULT for cluster PinotCluster, took 217 ms
2022/09/23 20:50:47.432 INFO [GenericHelixController] [HelixController-pipeline-default-PinotCluster-(c341ab48_DEFAULT)] Callback time for event: ResourceConfigChange took: 23 ms

Xiang Fu

09/23/2022, 8:57 PM

I think these exceptions can be ignored. cc: @Subbu Subramaniam @Jack to confirm

Gerrit van Doorn

09/23/2022, 8:58 PM

It prevents controller from starting

Gerrit van Doorn

09/23/2022, 9:00 PM

actually it does start but the health endpoint tells me:

Pinot controller status is BAD

Subbu Subramaniam

09/23/2022, 9:01 PM

@Jack should have a better answer. I don't think we have seen it in our environment, so this may be specific to the environment.

Gerrit van Doorn

09/23/2022, 9:01 PM

querying the existing table tells me:

Copy code

[
  {
    "message": "BrokerResourceMissingError",
    "errorCode": 410
  }
]

Gerrit van Doorn

09/23/2022, 9:02 PM

All segments are marked as BAD

Jack

09/23/2022, 9:04 PM

I don’t think the ZK exception will cause any issue. It just tried to check and create a new session if the existing one is stale

Gerrit van Doorn

09/23/2022, 9:05 PM

On the broker side I see a warning message:

Copy code

2022/09/23 20:38:09.561 WARN [ZkClient] [Start a Pinot [BROKER]] zkclient 6, Failed to delete path /PinotCluster/INSTANCES/Broker_fpp15a-rb32-36a.fpp.company.com_8000/CURRENTSTATES/2948835c1a9b313b! 
org.apache.helix.zookeeper.zkclient.exception.ZkException: org.apache.zookeeper.KeeperException$NotEmptyException: KeeperErrorCode = Directory not empty for /PinotCluster/INSTANCES/Broker_fpp15a-rb32-36a.fpp.company.com_8000/CURRENTSTATES/2948835c1a9b313b
        at org.apache.helix.zookeeper.zkclient.exception.ZkException.create(ZkException.java:72) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.zookeeper.zkclient.ZkClient.retryUntilConnected(ZkClient.java:2000) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.zookeeper.zkclient.ZkClient.delete(ZkClient.java:2058) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.manager.zk.ZkBaseDataAccessor.remove(ZkBaseDataAccessor.java:727) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.manager.zk.ZKHelixDataAccessor.removeProperty(ZKHelixDataAccessor.java:389) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.manager.zk.ParticipantManager.carryOverPreviousCurrentState(ParticipantManager.java:461) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.manager.zk.ParticipantManager.handleNewSession(ParticipantManager.java:162) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.manager.zk.ZKHelixManager.handleNewSessionAsParticipant(ZKHelixManager.java:1445) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.manager.zk.ZKHelixManager.handleNewSession(ZKHelixManager.java:1392) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.manager.zk.ZKHelixManager.createClient(ZKHelixManager.java:782) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.manager.zk.ZKHelixManager.connect(ZKHelixManager.java:819) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.pinot.broker.broker.helix.BaseBrokerStarter.start(BaseBrokerStarter.java:347) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.pinot.tools.service.PinotServiceManager.startBroker(PinotServiceManager.java:143) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.pinot.tools.service.PinotServiceManager.startRole(PinotServiceManager.java:92) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.pinot.tools.admin.command.StartServiceManagerCommand$1.lambda$run$0(StartServiceManagerCommand.java:278) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.pinot.tools.admin.command.StartServiceManagerCommand.startPinotService(StartServiceManagerCommand.java:304) [pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.pinot.tools.admin.command.StartServiceManagerCommand$1.run(StartServiceManagerCommand.java:278) [pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
Caused by: org.apache.zookeeper.KeeperException$NotEmptyException: KeeperErrorCode = Directory not empty for /PinotCluster/INSTANCES/Broker_fpp15a-rb32-36a.fpp.company.com_8000/CURRENTSTATES/2948835c1a9b313b
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:132) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:1793) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.zookeeper.zkclient.ZkConnection.delete(ZkConnection.java:144) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.zookeeper.zkclient.ZkClient$10.call(ZkClient.java:2062) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.zookeeper.zkclient.ZkClient.retryUntilConnected(ZkClient.java:1986) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]

Gerrit van Doorn

09/23/2022, 9:08 PM

I just created a new table (excactly the same as the existing one by different name). And it is marked as BAD

Gerrit van Doorn

09/23/2022, 9:16 PM

Something is going on with the ideal state vs external view states. New table is not consuming anything.

Xiang Fu

09/23/2022, 9:17 PM

hmm, any server log showing the error for segment consuming ?

Gerrit van Doorn

09/23/2022, 9:20 PM

Not seeing any errors (besides ZK) but see this:

Copy code

2022/09/23 20:38:12.199 INFO [BaseServerStarter] [Start a Pinot [SERVER]] Sleep for 10000ms as service status has not turned GOOD: PinotServiceManagerStatusCallback:Started;MultipleCallbackServiceStatusCallback:IdealStateAndCurrentStateMatchServiceStatusCallback:partition=events__0__0__20220922T2347Z, expected=ONLINE, found=OFFLINE, creationTime=1663965482827, modifiedTime=1663965482827, version=0, waitingFor=CurrentStateMatch, resource=events_REALTIME, numResourcesLeft=1, numTotalResources=1, minStartCount=1,;IdealStateAndExternalViewMatchServiceStatusCallback:Init;;
2022/09/23 20:38:22.211 INFO [BaseServerStarter] [Start a Pinot [SERVER]] Sleep for 10000ms as service status has not turned GOOD: PinotServiceManagerStatusCallback:Started;MultipleCallbackServiceStatusCallback:IdealStateAndCurrentStateMatchServiceStatusCallback:partition=events__0__0__20220922T2347Z, expected=ONLINE, found=OFFLINE, creationTime=1663965482827, modifiedTime=1663965482827, version=0, waitingFor=CurrentStateMatch, resource=events_REALTIME, numResourcesLeft=1, numTotalResources=1, minStartCount=1,;IdealStateAndExternalViewMatchServiceStatusCallback:Init;;
2022/09/23 20:38:32.223 INFO [BaseServerStarter] [Start a Pinot [SERVER]] Sleep for 10000ms as service status has not turned GOOD: PinotServiceManagerStatusCallback:Started;MultipleCallbackServiceStatusCallback:IdealStateAndCurrentStateMatchServiceStatusCallback:partition=events__0__0__20220922T2347Z, expected=ONLINE, found=OFFLINE, creationTime=1663965482827, modifiedTime=1663965482827, version=0, waitingFor=CurrentStateMatch, resource=events_REALTIME, numResourcesLeft=1, numTotalResources=1, minStartCount=1,;IdealStateAndExternalViewMatchServiceStatusCallback:Init;;

Gerrit van Doorn

09/23/2022, 9:21 PM

which I think is for the existing table, not the new one

Gerrit van Doorn

09/23/2022, 9:31 PM

restarting the server shows me this again:

Copy code

2022/09/23 21:31:08.445 WARN [ZkClient] [Start a Pinot [SERVER]] zkclient 3, Failed to delete path /PinotCluster/INSTANCES/Server_fdd5a-rb32-36a.fdd.company.com_7000/CURRENTSTATES/3e3d835c1a233ea5! 
org.apache.helix.zookeeper.zkclient.exception.ZkException: org.apache.zookeeper.KeeperException$NotEmptyException: KeeperErrorCode = Directory not empty for /PinotCluster/INSTANCES/Server_sjc15a-rb32-36a.sjc.dropbox.com_7000/CURRENTSTATES/3e3d835c1a233ea5
        at org.apache.helix.zookeeper.zkclient.exception.ZkException.create(ZkException.java:72) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.zookeeper.zkclient.ZkClient.retryUntilConnected(ZkClient.java:2000) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.zookeeper.zkclient.ZkClient.delete(ZkClient.java:2058) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.manager.zk.ZkBaseDataAccessor.remove(ZkBaseDataAccessor.java:727) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]

Gerrit van Doorn

09/23/2022, 10:11 PM

alright, I reverted back to 0.10.0. No idea what is going on here. No obvious errors. table started consuming from Kafka without issue. In 0.11.0, it is not

Jack

09/23/2022, 10:33 PM

@Gerrit van Doorn did you deploy controllers first, and then brokers, and then servers?

Gerrit van Doorn

09/23/2022, 10:33 PM

Yes

Jack

09/23/2022, 10:34 PM

when did you start seeing issues? during pinot-server deployment?

Gerrit van Doorn

09/23/2022, 10:39 PM

My deployment starts a controller, broker, and server. After they started the existing table stopped consuming and so segments were marked as BAD. Something about ideal state vs external view

Jack

09/23/2022, 10:41 PM

I think you need to deploy all the pinot-controllers first, then all the brokers, then all the servers

Jack

09/23/2022, 10:42 PM

if you just start with only 1 controller with newer version, it may not become the leader for your pinot cluster, thus the whole cluster is still maintained by an old version controller

Gerrit van Doorn

09/23/2022, 10:43 PM

After that I downed the broker and server, removed them from Pinot. Brought them back, and added a table, no luck either

Jack

09/23/2022, 10:44 PM

and the old controller does not know the new protocol to interact with new brokers and new servers

Jack

09/23/2022, 10:44 PM

https://apache-pinot.slack.com/archives/C011C9JHN7R/p1663972877891229?thread_ts=1663966602.302619&cid=C011C9JHN7R

Gerrit van Doorn

09/23/2022, 10:45 PM

It's a single controller

Jack

09/23/2022, 10:49 PM

can you paste the error message of adding a table when the new version of controller was run?

Jack

09/23/2022, 10:53 PM

After that I downed the broker and server

the best practice is down the server first and then the broker

Jack

09/23/2022, 10:54 PM

Brought them back, and added a table, no luck either

make sure they have the correct helix tag before adding the table

Gerrit van Doorn

09/23/2022, 11:09 PM

so there was no obvious error. The segments were marked BAD

Gerrit van Doorn

09/23/2022, 11:11 PM

what correct helix tag do you mean?

Jack

09/23/2022, 11:11 PM

how many controllers do you have? only 1? and how many brokers and servers?

Gerrit van Doorn

09/23/2022, 11:11 PM

It’s currently a single machine with 1 controller, 1 broker, 1 server

Jack

09/23/2022, 11:11 PM

what correct helix tag do you mean?

I mean the tenant name for brokers and servers should be correct

Gerrit van Doorn

09/23/2022, 11:13 PM

OK I can rollback to 0.11.0. What do you advice me to do? I can remove any existing tables. Not caring about the data right now

Jack

09/23/2022, 11:15 PM

sure, you can just start with deploying the controller first, making sure it’s in GOOD state. Then deploy broker. Then server.

Jack

09/23/2022, 11:16 PM

After deploying the controller, try out some controller APIs like listing tables, listing segments, etc.

Jack

09/23/2022, 11:18 PM

After deploying the broker, try to fire several queries to broker

Gerrit van Doorn

09/23/2022, 11:20 PM

I’m pretty sure it will give me:

Copy code

[
  {
    "message": "BrokerResourceMissingError",
    "errorCode": 410
  }
]

as mentioned earlier

Jack

09/23/2022, 11:22 PM

you can try this controller API: https://github.com/apache/pinot/blob/master/pinot-controller/src/main/java/org/apache/pinot/controller/api/resources/PinotTableTenantConfigs.java

Gerrit van Doorn

09/23/2022, 11:24 PM

I have only the controller running now. No tables, no server, no broker. /health tells me: “Pinot controller status is BAD”

Gerrit van Doorn

09/23/2022, 11:24 PM

Other than that the controller UI works fine

Gerrit van Doorn

09/23/2022, 11:24 PM

and listing tables gives me: []

Jack

09/23/2022, 11:25 PM

can you check the controller log?

Jack

09/23/2022, 11:26 PM

is there any exception shown there when listing the tables?

Gerrit van Doorn

09/23/2022, 11:28 PM

I don’t think so, but it get regular exceptions from the ZK client so it’s difficult to see

Jack

09/23/2022, 11:28 PM

what regular exception do you refer to? can you paste it here?

Gerrit van Doorn

09/23/2022, 11:29 PM

Copy code

2022/09/23 23:29:12.874 WARN [ClientCnxn] [main-SendThread(localhost:2185)] Session 0x4e15835c1b36388d for server localhost/127.0.0.1:2185, unexpected error, closing socket connection and attempting reconnect
java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:?]
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) ~[?:?]
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:276) ~[?:?]
        at sun.nio.ch.IOUtil.read(IOUtil.java:233) ~[?:?]
        at sun.nio.ch.IOUtil.read(IOUtil.java:223) ~[?:?]
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:356) ~[?:?]
        at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:75) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:363) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1223) [pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
2022/09/23 23:29:12.976 INFO [ZkClient] [main-EventThread] zkclient 5, zookeeper state changed ( Disconnected )

Jack

09/23/2022, 11:30 PM

can you check the content in the znode of

<yourClusterName>/CONTROLLER/LEADER

Gerrit van Doorn

09/23/2022, 11:31 PM

those repeat multiple times. And the following one just happened:

Copy code

022/09/23 23:30:05.713 INFO [AbstractDataCache] [HelixController-pipeline-task-PinotCluster-(a87072fd_TASK)] Event PinotCluster::TASK::a87072fd_TASK : 0 properties refreshed from ZK.
2022/09/23 23:30:05.713 INFO [ParticipantStateCache] [HelixController-pipeline-task-PinotCluster-(a87072fd_TASK)] Event PinotCluster::TASK::a87072fd_TASK : END: participantStateCache.refresh() for cluster PinotCluster, started at : 1663975805710, took 3 ms
2022/09/23 23:30:05.713 INFO [InstanceMessagesCache] [HelixController-pipeline-task-PinotCluster-(a87072fd_TASK)] END: updateRelayMessages(), 0 of valid relay messages in cache, took 0 ms. 
2022/09/23 23:30:05.713 ERROR [GenericHelixController] [HelixController-pipeline-default-PinotCluster-(c9e6dae4_DEFAULT)] Exception while executing DEFAULT pipeline for cluster PinotCluster. Will not continue to next pipeline
org.apache.helix.zookeeper.zkclient.exception.ZkSessionMismatchedException: Failed to get expected zookeeper instance! There is a session id mismatch. Expected: 4e15835c1b363939. Actual: 4e15835c1b363998
        at org.apache.helix.zookeeper.zkclient.ZkClient.getExpectedZookeeper(ZkClient.java:2746) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.zookeeper.zkclient.ZkClient.lambda$doAsyncCreate$10(ZkClient.java:2257) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.zookeeper.zkclient.ZkClient.retryUntilConnected(ZkClient.java:1986) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.zookeeper.zkclient.ZkClient.doAsyncCreate(ZkClient.java:2256) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.zookeeper.zkclient.ZkClient.asyncCreate(ZkClient.java:2250) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.manager.zk.ZkBaseDataAccessor.create(ZkBaseDataAccessor.java:784) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.manager.zk.ZkBaseDataAccessor.createChildren(ZkBaseDataAccessor.java:884) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.manager.zk.ZkBaseDataAccessor.createChildren(ZkBaseDataAccessor.java:858) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.manager.zk.ZKHelixDataAccessor.createChildren(ZKHelixDataAccessor.java:519) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.controller.stages.MessageDispatchStage.sendMessages(MessageDispatchStage.java:187) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.controller.stages.MessageDispatchStage.processEvent(MessageDispatchStage.java:96) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.controller.stages.resource.ResourceMessageDispatchStage.process(ResourceMessageDispatchStage.java:33) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.controller.pipeline.Pipeline.handle(Pipeline.java:75) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.controller.GenericHelixController.handleEvent(GenericHelixController.java:903) [pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.controller.GenericHelixController.access$500(GenericHelixController.java:132) [pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.controller.GenericHelixController$ClusterEventProcessor.run(GenericHelixController.java:1554) [pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
2022/09/23 23:30:05.713 INFO [GenericHelixController] [HelixController-pipeline-default-PinotCluster-(c9e6dae4_DEFAULT)] END: Invoking DEFAULT controller pipeline for event ResourceConfigChange::c9e6dae4_DEFAULT for cluster PinotCluster, took 87 ms

Gerrit van Doorn

09/23/2022, 11:32 PM

Copy code

{
  "id": "foo15a-rb32-36a.foo.foo.com_9000",
  "simpleFields": {
    "HELIX_VERSION": "1.0.4",
    "LIVE_INSTANCE": "1@foo15a-rb32-36a",
    "SESSION_ID": "4e15835c1b3639f8"
  },
  "mapFields": {},
  "listFields": {}
}

Jack

09/23/2022, 11:33 PM

how about the content in znode

<yourclustername>/EXTERNALVIEW/leadControllerResource

Jack

09/23/2022, 11:33 PM

are they all showing in

MASTER

state?

Gerrit van Doorn

09/23/2022, 11:34 PM

Copy code

{
  "id": "leadControllerResource",
  "simpleFields": {
    "BATCH_MESSAGE_MODE": "false",
    "BUCKET_SIZE": "0",
    "DELAY_REBALANCE_ENABLED": "true",
    "IDEAL_STATE_MODE": "AUTO_REBALANCE",
    "INSTANCE_GROUP_TAG": "controller",
    "MIN_ACTIVE_REPLICAS": "0",
    "NUM_PARTITIONS": "24",
    "REBALANCE_DELAY": "300000",
    "REBALANCE_MODE": "FULL_AUTO",
    "REBALANCE_STRATEGY": "org.apache.helix.controller.rebalancer.strategy.AutoRebalanceStrategy",
    "REPLICAS": "1",
    "STATE_MODEL_DEF_REF": "MasterSlave",
    "STATE_MODEL_FACTORY_NAME": "DEFAULT"
  },
  "mapFields": {
    "leadControllerResource_0": {
      "Controller_foo15a-rb32-36a.foo.foo.com_9000": "OFFLINE"
    },
    "leadControllerResource_1": {
      "Controller_foo15a-rb32-36a.foo.foo.com_9000": "OFFLINE"
    },
    "leadControllerResource_10": {
      "Controller_foo15a-rb32-36a.foo.foo.com_9000": "OFFLINE"
    },
    "leadControllerResource_11": {
      "Controller_foo15a-rb32-36a.foo.foo.com_9000": "OFFLINE"
    },
    "leadControllerResource_12": {
      "Controller_foo15a-rb32-36a.foo.foo.com_9000": "OFFLINE"
    },
    "leadControllerResource_13": {
      "Controller_foo15a-rb32-36a.foo.foo.com_9000": "OFFLINE"
    },
    "leadControllerResource_14": {
      "Controller_foo15a-rb32-36a.foo.foo.com_9000": "OFFLINE"
    },
    "leadControllerResource_15": {
      "Controller_foo15a-rb32-36a.foo.foo.com_9000": "OFFLINE"
    },
    "leadControllerResource_16": {
      "Controller_foo15a-rb32-36a.foo.foo.com_9000": "OFFLINE"
    },
    "leadControllerResource_17": {
      "Controller_foo15a-rb32-36a.foo.foo.com_9000": "OFFLINE"
    },
    "leadControllerResource_18": {
      "Controller_foo15a-rb32-36a.foo.foo.com_9000": "OFFLINE"
    },
    "leadControllerResource_19": {
      "Controller_foo15a-rb32-36a.foo.foo.com_9000": "OFFLINE"
    },
    "leadControllerResource_2": {
      "Controller_foo15a-rb32-36a.foo.foo.com_9000": "OFFLINE"
    },
    "leadControllerResource_20": {
      "Controller_foo15a-rb32-36a.foo.foo.com_9000": "OFFLINE"
    },
    "leadControllerResource_21": {
      "Controller_foo15a-rb32-36a.foo.foo.com_9000": "OFFLINE"
    },
    "leadControllerResource_22": {
      "Controller_foo15a-rb32-36a.foo.foo.com_9000": "OFFLINE"
    },
    "leadControllerResource_23": {
      "Controller_foo15a-rb32-36a.foo.foo.com_9000": "OFFLINE"
    },
    "leadControllerResource_3": {
      "Controller_foo15a-rb32-36a.foo.foo.com_9000": "OFFLINE"
    },
    "leadControllerResource_4": {
      "Controller_foo15a-rb32-36a.foo.foo.com_9000": "OFFLINE"
    },
    "leadControllerResource_5": {
      "Controller_foo15a-rb32-36a.foo.foo.com_9000": "OFFLINE"
    },
    "leadControllerResource_6": {
      "Controller_foo15a-rb32-36a.foo.foo.com_9000": "OFFLINE"
    },
    "leadControllerResource_7": {
      "Controller_foo15a-rb32-36a.foo.foo.com_9000": "OFFLINE"
    },
    "leadControllerResource_8": {
      "Controller_foo15a-rb32-36a.foo.foo.com_9000": "OFFLINE"
    },
    "leadControllerResource_9": {
      "Controller_foo15a-rb32-36a.foo.foo.com_9000": "OFFLINE"
    }
  },
  "listFields": {}
}

Gerrit van Doorn

09/23/2022, 11:35 PM

MASTER

state?

Jack

09/23/2022, 11:35 PM

right, they should have been in

MASTER

state

Gerrit van Doorn

09/23/2022, 11:36 PM

instead of?

Jack

09/23/2022, 11:36 PM

instead of “OFFLINE”. can you check the controller log by searching those names like

leadControllerResource_9

Gerrit van Doorn

09/23/2022, 11:37 PM

last message I see is:

Copy code

2022/09/23 23:36:48.523 INFO [MessageDispatchStage] [HelixController-pipeline-default-PinotCluster-(289afc88_DEFAULT)] Event 289afc88_DEFAULT : Sending Message d6033d5d-d662-46c5-a7d1-b688a6cf58c2 to Controller_foo15a-rb32-36a.foo.foo.com_9000 transit leadControllerResource.leadControllerResource_9|[] from:OFFLINE to:SLAVE, relayMessages: 0

Jack

09/23/2022, 11:38 PM

you should have seen some message saying from OFFLINE to MASTER, or from slave to master

Gerrit van Doorn

09/23/2022, 11:38 PM

before that I see:

Copy code

2022/09/23 23:29:04.378 INFO [AutoRebalanceStrategy] [HelixController-pipeline-default-PinotCluster-(36120192_DEFAULT)] orphan = [leadControllerResource_0|0, leadControllerResource_10|0, leadControllerResource_11|0, leadControllerResource_12|0, leadControllerResource_13|0, leadControllerResource_14|0, leadControllerResource_15|0, leadControllerResource_16|0, leadControllerResource_17|0, leadControllerResource_18|0, leadControllerResource_19|0, leadControllerResource_1|0, leadControllerResource_20|0, leadControllerResource_21|0, leadControllerResource_22|0, leadControllerResource_23|0, leadControllerResource_2|0, leadControllerResource_3|0, leadControllerResource_4|0, leadControllerResource_5|0, leadControllerResource_6|0, leadControllerResource_7|0, leadControllerResource_8|0, leadControllerResource_9|0]

Gerrit van Doorn

09/23/2022, 11:39 PM

no mention of MASTER in the whole log

🤔 1

Jack

09/23/2022, 11:40 PM

how about from offline to slave?

Gerrit van Doorn

09/23/2022, 11:40 PM

well that’s what I just pasted

Gerrit van Doorn

09/23/2022, 11:41 PM

Copy code

E to:SLAVE, relayMessages: 0
2022/09/23 23:40:02.578 INFO [MessageDispatchStage] [HelixController-pipeline-default-PinotCluster-(8eb21edd_DEFAULT)] Event 8eb21edd_DEFAULT : Sending Message 17908ae8-9680-4de9-ad10-e62addcbd2e5 to Controller_foo15a-rb32-36a.foo.foo.com_9000 transit leadControllerResource.leadControllerResource_6|[] from:OFFLINE to:SLAVE, relayMessages: 0
2022/09/23 23:40:02.578 INFO [MessageDispatchStage] [HelixController-pipeline-default-PinotCluster-(8eb21edd_DEFAULT)] Event 8eb21edd_DEFAULT : Sending Message c9ad8be1-09d6-4675-9cbf-40fce1a22740 to Controller_foo15a-rb32-36a.foo.foo.com_9000 transit leadControllerResource.leadControllerResource_7|[] from:OFFLINE to:SLAVE, relayMessages: 0
2022/09/23 23:40:02.578 INFO [MessageDispatchStage] [HelixController-pipeline-default-PinotCluster-(8eb21edd_DEFAULT)] Event 8eb21edd_DEFAULT : Sending Message 227889e7-54f5-462a-a112-b0b6258a5c41 to Controller_foo15a-rb32-36a.foo.foo.com_9000 transit leadControllerResource.leadControllerResource_8|[] from:OFFLINE to:SLAVE, relayMessages: 0
2022/09/23 23:40:02.578 INFO [MessageDispatchStage] [HelixController-pipeline-default-PinotCluster-(8eb21edd_DEFAULT)] Event 8eb21edd_DEFAULT : Sending Message 5fa36192-d3e4-4493-9b9c-101d1d81c88f to Controller_foo15a-rb32-36a.foo.foo.com_9000 transit leadControllerResource.leadControllerResource_9|[] from:OFFLINE to:SLAVE, relayMessages: 0

Jack

09/23/2022, 11:41 PM

any exception around those messsages?

Gerrit van Doorn

09/23/2022, 11:42 PM

Jack

09/23/2022, 11:42 PM

could you also paste the content of znode

<yourcluster>/CONFIGS/RESOURCE/leadControllerResource

Gerrit van Doorn

09/23/2022, 11:43 PM

Copy code

{
  "id": "leadControllerResource",
  "simpleFields": {
    "RESOURCE_ENABLED": "true"
  },
  "mapFields": {},
  "listFields": {}
}

Jack

09/23/2022, 11:44 PM

one last thing I might want you to do is to change the rebalance strategy to sth like below in the ideal state znode of `leadControllerResource`:

Copy code

"REBALANCE_STRATEGY" : "org.apache.helix.controller.rebalancer.strategy.CrushEdRebalanceStrategy",

Jack

09/23/2022, 11:45 PM

and restart the controller with 0.11.0 again

Gerrit van Doorn

09/23/2022, 11:48 PM

I changed it in external view. should it be in

<yourcluster>/CONFIGS/RESOURCE/leadControllerResource

Jack

09/23/2022, 11:49 PM

Nah it should be just ideal state not resource config

Gerrit van Doorn

09/23/2022, 11:51 PM

ok restarted

Gerrit van Doorn

09/23/2022, 11:52 PM

seeing the same messages:

Copy code

2022/09/23 23:51:44.329 INFO [MessageDispatchStage] [HelixController-pipeline-default-PinotCluster-(c217e5a8_DEFAULT)] Event c217e5a8_DEFAULT : Sending Message aeff7905-10b3-41f6-86ce-293fc532fd35 to Controller_sjc15a-rb32-36a.sjc.dropbox.com_9000 transit leadControllerResource.leadControllerResource_9|[] from:OFFLINE to:SLAVE, relayMessages: 0

Jack

09/23/2022, 11:58 PM

Have you changed the one in idealstate as well as the external view?

Gerrit van Doorn

09/23/2022, 11:59 PM

they are the same now

Gerrit van Doorn

09/24/2022, 12:00 AM

when is a controller health considered “GOOD”?

Jack

09/24/2022, 12:06 AM

Have you restarted controller after making the Znode update?

Gerrit van Doorn

09/24/2022, 12:29 AM

yes

Subbu Subramaniam

09/24/2022, 12:35 AM

@Jack does the zookeeper version matter? @Gerrit van Doorn what version of zookeeper are you using?

Gerrit van Doorn

09/24/2022, 1:38 AM

Good question, I read wondering the same thing. Is there a difference between 0.10 and 0.11? We have an process on top of etcd. Will have to reach out to see what API version it supports

Gerrit van Doorn

09/26/2022, 7:11 PM

The team responsible for that service says that it supports ZK 3.6

Gerrit van Doorn

09/26/2022, 7:12 PM

Is there a minimum version specified anywhere?

Open in Slack

Previous Next