Hi folks, I’m trying to upgrade from release-0.10....
# troubleshooting
g
Hi folks, I’m trying to upgrade from release-0.10.0 to release-0.11.0 and I’m greeted with the following errors:
Copy code
2022/09/23 20:53:11.493 WARN [ClientCnxn] [main-SendThread(localhost:2185)] Session 0x3e3d835c1a2346f7 for server localhost/127.0.0.1:2185, unexpected error, closing socket connection and attempting reconnect
java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:?]
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) ~[?:?]
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:276) ~[?:?]
        at sun.nio.ch.IOUtil.read(IOUtil.java:233) ~[?:?]
        at sun.nio.ch.IOUtil.read(IOUtil.java:223) ~[?:?]
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:356) ~[?:?]
        at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:75) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:363) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1223) [pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
2022/09/23 20:53:11.594 INFO [ZkClient] [main-EventThread] zkclient 5, zookeeper state changed ( Disconnected )
2022/09/23 20:53:12.441 INFO [ControllerResponseFilter] [grizzly-http-server-22] Handled request from 127.0.0.1 GET <http://localhost:9000/health>, content-type null status code 503 Service Unavailable
What could be the reason for this?
I also see:
Copy code
2022/09/23 20:50:47.431 INFO [CallbackHandler] [ZkClient-EventThread-98-localhost:2185] Callbackhandler org.apache.helix.manager.zk.CallbackHandler@e414c0f with path /PinotCluster/LIVEINSTANCES is in reset state. Stop subscription to ZK client to avoid leaking
2022/09/23 20:50:47.431 ERROR [GenericHelixController] [HelixController-pipeline-default-PinotCluster-(c341ab48_DEFAULT)] Exception while executing DEFAULT pipeline for cluster PinotCluster. Will not continue to next pipeline
org.apache.helix.zookeeper.zkclient.exception.ZkSessionMismatchedException: Failed to get expected zookeeper instance! There is a session id mismatch. Expected: 3e3d835c1a2344e3. Actual: 3e3d835c1a2346f7
        at org.apache.helix.zookeeper.zkclient.ZkClient.getExpectedZookeeper(ZkClient.java:2746) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.zookeeper.zkclient.ZkClient.lambda$doAsyncCreate$10(ZkClient.java:2257) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.zookeeper.zkclient.ZkClient.retryUntilConnected(ZkClient.java:1986) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.zookeeper.zkclient.ZkClient.doAsyncCreate(ZkClient.java:2256) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.zookeeper.zkclient.ZkClient.asyncCreate(ZkClient.java:2250) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.manager.zk.ZkBaseDataAccessor.create(ZkBaseDataAccessor.java:784) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.manager.zk.ZkBaseDataAccessor.createChildren(ZkBaseDataAccessor.java:884) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.manager.zk.ZkBaseDataAccessor.createChildren(ZkBaseDataAccessor.java:858) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.manager.zk.ZKHelixDataAccessor.createChildren(ZKHelixDataAccessor.java:519) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.controller.stages.MessageDispatchStage.sendMessages(MessageDispatchStage.java:187) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.controller.stages.MessageDispatchStage.processEvent(MessageDispatchStage.java:96) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.controller.stages.resource.ResourceMessageDispatchStage.process(ResourceMessageDispatchStage.java:33) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.controller.pipeline.Pipeline.handle(Pipeline.java:75) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.controller.GenericHelixController.handleEvent(GenericHelixController.java:903) [pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.controller.GenericHelixController.access$500(GenericHelixController.java:132) [pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.controller.GenericHelixController$ClusterEventProcessor.run(GenericHelixController.java:1554) [pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
2022/09/23 20:50:47.432 INFO [GenericHelixController] [HelixController-pipeline-default-PinotCluster-(c341ab48_DEFAULT)] END: Invoking DEFAULT controller pipeline for event ResourceConfigChange::c341ab48_DEFAULT for cluster PinotCluster, took 217 ms
2022/09/23 20:50:47.432 INFO [GenericHelixController] [HelixController-pipeline-default-PinotCluster-(c341ab48_DEFAULT)] Callback time for event: ResourceConfigChange took: 23 ms
x
I think these exceptions can be ignored. cc: @Subbu Subramaniam @Jack to confirm
g
It prevents controller from starting
actually it does start but the health endpoint tells me:
Pinot controller status is BAD
s
@Jack should have a better answer. I don't think we have seen it in our environment, so this may be specific to the environment.
g
querying the existing table tells me:
Copy code
[
  {
    "message": "BrokerResourceMissingError",
    "errorCode": 410
  }
]
All segments are marked as BAD
j
I don’t think the ZK exception will cause any issue. It just tried to check and create a new session if the existing one is stale
g
On the broker side I see a warning message:
Copy code
2022/09/23 20:38:09.561 WARN [ZkClient] [Start a Pinot [BROKER]] zkclient 6, Failed to delete path /PinotCluster/INSTANCES/Broker_fpp15a-rb32-36a.fpp.company.com_8000/CURRENTSTATES/2948835c1a9b313b! 
org.apache.helix.zookeeper.zkclient.exception.ZkException: org.apache.zookeeper.KeeperException$NotEmptyException: KeeperErrorCode = Directory not empty for /PinotCluster/INSTANCES/Broker_fpp15a-rb32-36a.fpp.company.com_8000/CURRENTSTATES/2948835c1a9b313b
        at org.apache.helix.zookeeper.zkclient.exception.ZkException.create(ZkException.java:72) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.zookeeper.zkclient.ZkClient.retryUntilConnected(ZkClient.java:2000) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.zookeeper.zkclient.ZkClient.delete(ZkClient.java:2058) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.manager.zk.ZkBaseDataAccessor.remove(ZkBaseDataAccessor.java:727) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.manager.zk.ZKHelixDataAccessor.removeProperty(ZKHelixDataAccessor.java:389) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.manager.zk.ParticipantManager.carryOverPreviousCurrentState(ParticipantManager.java:461) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.manager.zk.ParticipantManager.handleNewSession(ParticipantManager.java:162) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.manager.zk.ZKHelixManager.handleNewSessionAsParticipant(ZKHelixManager.java:1445) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.manager.zk.ZKHelixManager.handleNewSession(ZKHelixManager.java:1392) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.manager.zk.ZKHelixManager.createClient(ZKHelixManager.java:782) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.manager.zk.ZKHelixManager.connect(ZKHelixManager.java:819) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.pinot.broker.broker.helix.BaseBrokerStarter.start(BaseBrokerStarter.java:347) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.pinot.tools.service.PinotServiceManager.startBroker(PinotServiceManager.java:143) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.pinot.tools.service.PinotServiceManager.startRole(PinotServiceManager.java:92) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.pinot.tools.admin.command.StartServiceManagerCommand$1.lambda$run$0(StartServiceManagerCommand.java:278) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.pinot.tools.admin.command.StartServiceManagerCommand.startPinotService(StartServiceManagerCommand.java:304) [pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.pinot.tools.admin.command.StartServiceManagerCommand$1.run(StartServiceManagerCommand.java:278) [pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
Caused by: org.apache.zookeeper.KeeperException$NotEmptyException: KeeperErrorCode = Directory not empty for /PinotCluster/INSTANCES/Broker_fpp15a-rb32-36a.fpp.company.com_8000/CURRENTSTATES/2948835c1a9b313b
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:132) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:1793) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.zookeeper.zkclient.ZkConnection.delete(ZkConnection.java:144) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.zookeeper.zkclient.ZkClient$10.call(ZkClient.java:2062) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.zookeeper.zkclient.ZkClient.retryUntilConnected(ZkClient.java:1986) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
I just created a new table (excactly the same as the existing one by different name). And it is marked as BAD
Something is going on with the ideal state vs external view states. New table is not consuming anything.
x
hmm, any server log showing the error for segment consuming ?
g
Not seeing any errors (besides ZK) but see this:
Copy code
2022/09/23 20:38:12.199 INFO [BaseServerStarter] [Start a Pinot [SERVER]] Sleep for 10000ms as service status has not turned GOOD: PinotServiceManagerStatusCallback:Started;MultipleCallbackServiceStatusCallback:IdealStateAndCurrentStateMatchServiceStatusCallback:partition=events__0__0__20220922T2347Z, expected=ONLINE, found=OFFLINE, creationTime=1663965482827, modifiedTime=1663965482827, version=0, waitingFor=CurrentStateMatch, resource=events_REALTIME, numResourcesLeft=1, numTotalResources=1, minStartCount=1,;IdealStateAndExternalViewMatchServiceStatusCallback:Init;;
2022/09/23 20:38:22.211 INFO [BaseServerStarter] [Start a Pinot [SERVER]] Sleep for 10000ms as service status has not turned GOOD: PinotServiceManagerStatusCallback:Started;MultipleCallbackServiceStatusCallback:IdealStateAndCurrentStateMatchServiceStatusCallback:partition=events__0__0__20220922T2347Z, expected=ONLINE, found=OFFLINE, creationTime=1663965482827, modifiedTime=1663965482827, version=0, waitingFor=CurrentStateMatch, resource=events_REALTIME, numResourcesLeft=1, numTotalResources=1, minStartCount=1,;IdealStateAndExternalViewMatchServiceStatusCallback:Init;;
2022/09/23 20:38:32.223 INFO [BaseServerStarter] [Start a Pinot [SERVER]] Sleep for 10000ms as service status has not turned GOOD: PinotServiceManagerStatusCallback:Started;MultipleCallbackServiceStatusCallback:IdealStateAndCurrentStateMatchServiceStatusCallback:partition=events__0__0__20220922T2347Z, expected=ONLINE, found=OFFLINE, creationTime=1663965482827, modifiedTime=1663965482827, version=0, waitingFor=CurrentStateMatch, resource=events_REALTIME, numResourcesLeft=1, numTotalResources=1, minStartCount=1,;IdealStateAndExternalViewMatchServiceStatusCallback:Init;;
which I think is for the existing table, not the new one
restarting the server shows me this again:
Copy code
2022/09/23 21:31:08.445 WARN [ZkClient] [Start a Pinot [SERVER]] zkclient 3, Failed to delete path /PinotCluster/INSTANCES/Server_fdd5a-rb32-36a.fdd.company.com_7000/CURRENTSTATES/3e3d835c1a233ea5! 
org.apache.helix.zookeeper.zkclient.exception.ZkException: org.apache.zookeeper.KeeperException$NotEmptyException: KeeperErrorCode = Directory not empty for /PinotCluster/INSTANCES/Server_sjc15a-rb32-36a.sjc.dropbox.com_7000/CURRENTSTATES/3e3d835c1a233ea5
        at org.apache.helix.zookeeper.zkclient.exception.ZkException.create(ZkException.java:72) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.zookeeper.zkclient.ZkClient.retryUntilConnected(ZkClient.java:2000) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.zookeeper.zkclient.ZkClient.delete(ZkClient.java:2058) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.manager.zk.ZkBaseDataAccessor.remove(ZkBaseDataAccessor.java:727) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
alright, I reverted back to 0.10.0. No idea what is going on here. No obvious errors. table started consuming from Kafka without issue. In 0.11.0, it is not
j
@Gerrit van Doorn did you deploy controllers first, and then brokers, and then servers?
g
Yes
j
when did you start seeing issues? during pinot-server deployment?
g
My deployment starts a controller, broker, and server. After they started the existing table stopped consuming and so segments were marked as BAD. Something about ideal state vs external view
j
I think you need to deploy all the pinot-controllers first, then all the brokers, then all the servers
if you just start with only 1 controller with newer version, it may not become the leader for your pinot cluster, thus the whole cluster is still maintained by an old version controller
g
After that I downed the broker and server, removed them from Pinot. Brought them back, and added a table, no luck either
j
and the old controller does not know the new protocol to interact with new brokers and new servers
g
It's a single controller
j
can you paste the error message of adding a table when the new version of controller was run?
After that I downed the broker and server
the best practice is down the server first and then the broker
Brought them back, and added a table, no luck either
make sure they have the correct helix tag before adding the table
g
so there was no obvious error. The segments were marked BAD
what correct helix tag do you mean?
j
how many controllers do you have? only 1? and how many brokers and servers?
g
It’s currently a single machine with 1 controller, 1 broker, 1 server
j
what correct helix tag do you mean?
I mean the tenant name for brokers and servers should be correct
g
OK I can rollback to 0.11.0. What do you advice me to do? I can remove any existing tables. Not caring about the data right now
j
sure, you can just start with deploying the controller first, making sure it’s in GOOD state. Then deploy broker. Then server.
After deploying the controller, try out some controller APIs like listing tables, listing segments, etc.
After deploying the broker, try to fire several queries to broker
g
I’m pretty sure it will give me:
Copy code
[
  {
    "message": "BrokerResourceMissingError",
    "errorCode": 410
  }
]
as mentioned earlier
g
I have only the controller running now. No tables, no server, no broker. /health tells me: “Pinot controller status is BAD”
Other than that the controller UI works fine
and listing tables gives me: []
j
can you check the controller log?
is there any exception shown there when listing the tables?
g
I don’t think so, but it get regular exceptions from the ZK client so it’s difficult to see
j
what regular exception do you refer to? can you paste it here?
g
Copy code
2022/09/23 23:29:12.874 WARN [ClientCnxn] [main-SendThread(localhost:2185)] Session 0x4e15835c1b36388d for server localhost/127.0.0.1:2185, unexpected error, closing socket connection and attempting reconnect
java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:?]
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) ~[?:?]
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:276) ~[?:?]
        at sun.nio.ch.IOUtil.read(IOUtil.java:233) ~[?:?]
        at sun.nio.ch.IOUtil.read(IOUtil.java:223) ~[?:?]
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:356) ~[?:?]
        at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:75) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:363) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1223) [pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
2022/09/23 23:29:12.976 INFO [ZkClient] [main-EventThread] zkclient 5, zookeeper state changed ( Disconnected )
j
can you check the content in the znode of
<yourClusterName>/CONTROLLER/LEADER
?
g
those repeat multiple times. And the following one just happened:
Copy code
022/09/23 23:30:05.713 INFO [AbstractDataCache] [HelixController-pipeline-task-PinotCluster-(a87072fd_TASK)] Event PinotCluster::TASK::a87072fd_TASK : 0 properties refreshed from ZK.
2022/09/23 23:30:05.713 INFO [ParticipantStateCache] [HelixController-pipeline-task-PinotCluster-(a87072fd_TASK)] Event PinotCluster::TASK::a87072fd_TASK : END: participantStateCache.refresh() for cluster PinotCluster, started at : 1663975805710, took 3 ms
2022/09/23 23:30:05.713 INFO [InstanceMessagesCache] [HelixController-pipeline-task-PinotCluster-(a87072fd_TASK)] END: updateRelayMessages(), 0 of valid relay messages in cache, took 0 ms. 
2022/09/23 23:30:05.713 ERROR [GenericHelixController] [HelixController-pipeline-default-PinotCluster-(c9e6dae4_DEFAULT)] Exception while executing DEFAULT pipeline for cluster PinotCluster. Will not continue to next pipeline
org.apache.helix.zookeeper.zkclient.exception.ZkSessionMismatchedException: Failed to get expected zookeeper instance! There is a session id mismatch. Expected: 4e15835c1b363939. Actual: 4e15835c1b363998
        at org.apache.helix.zookeeper.zkclient.ZkClient.getExpectedZookeeper(ZkClient.java:2746) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.zookeeper.zkclient.ZkClient.lambda$doAsyncCreate$10(ZkClient.java:2257) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.zookeeper.zkclient.ZkClient.retryUntilConnected(ZkClient.java:1986) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.zookeeper.zkclient.ZkClient.doAsyncCreate(ZkClient.java:2256) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.zookeeper.zkclient.ZkClient.asyncCreate(ZkClient.java:2250) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.manager.zk.ZkBaseDataAccessor.create(ZkBaseDataAccessor.java:784) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.manager.zk.ZkBaseDataAccessor.createChildren(ZkBaseDataAccessor.java:884) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.manager.zk.ZkBaseDataAccessor.createChildren(ZkBaseDataAccessor.java:858) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.manager.zk.ZKHelixDataAccessor.createChildren(ZKHelixDataAccessor.java:519) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.controller.stages.MessageDispatchStage.sendMessages(MessageDispatchStage.java:187) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.controller.stages.MessageDispatchStage.processEvent(MessageDispatchStage.java:96) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.controller.stages.resource.ResourceMessageDispatchStage.process(ResourceMessageDispatchStage.java:33) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.controller.pipeline.Pipeline.handle(Pipeline.java:75) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.controller.GenericHelixController.handleEvent(GenericHelixController.java:903) [pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.controller.GenericHelixController.access$500(GenericHelixController.java:132) [pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
        at org.apache.helix.controller.GenericHelixController$ClusterEventProcessor.run(GenericHelixController.java:1554) [pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
2022/09/23 23:30:05.713 INFO [GenericHelixController] [HelixController-pipeline-default-PinotCluster-(c9e6dae4_DEFAULT)] END: Invoking DEFAULT controller pipeline for event ResourceConfigChange::c9e6dae4_DEFAULT for cluster PinotCluster, took 87 ms
Copy code
{
  "id": "foo15a-rb32-36a.foo.foo.com_9000",
  "simpleFields": {
    "HELIX_VERSION": "1.0.4",
    "LIVE_INSTANCE": "1@foo15a-rb32-36a",
    "SESSION_ID": "4e15835c1b3639f8"
  },
  "mapFields": {},
  "listFields": {}
}
j
how about the content in znode
<yourclustername>/EXTERNALVIEW/leadControllerResource
?
are they all showing in
MASTER
state?
g
Copy code
{
  "id": "leadControllerResource",
  "simpleFields": {
    "BATCH_MESSAGE_MODE": "false",
    "BUCKET_SIZE": "0",
    "DELAY_REBALANCE_ENABLED": "true",
    "IDEAL_STATE_MODE": "AUTO_REBALANCE",
    "INSTANCE_GROUP_TAG": "controller",
    "MIN_ACTIVE_REPLICAS": "0",
    "NUM_PARTITIONS": "24",
    "REBALANCE_DELAY": "300000",
    "REBALANCE_MODE": "FULL_AUTO",
    "REBALANCE_STRATEGY": "org.apache.helix.controller.rebalancer.strategy.AutoRebalanceStrategy",
    "REPLICAS": "1",
    "STATE_MODEL_DEF_REF": "MasterSlave",
    "STATE_MODEL_FACTORY_NAME": "DEFAULT"
  },
  "mapFields": {
    "leadControllerResource_0": {
      "Controller_foo15a-rb32-36a.foo.foo.com_9000": "OFFLINE"
    },
    "leadControllerResource_1": {
      "Controller_foo15a-rb32-36a.foo.foo.com_9000": "OFFLINE"
    },
    "leadControllerResource_10": {
      "Controller_foo15a-rb32-36a.foo.foo.com_9000": "OFFLINE"
    },
    "leadControllerResource_11": {
      "Controller_foo15a-rb32-36a.foo.foo.com_9000": "OFFLINE"
    },
    "leadControllerResource_12": {
      "Controller_foo15a-rb32-36a.foo.foo.com_9000": "OFFLINE"
    },
    "leadControllerResource_13": {
      "Controller_foo15a-rb32-36a.foo.foo.com_9000": "OFFLINE"
    },
    "leadControllerResource_14": {
      "Controller_foo15a-rb32-36a.foo.foo.com_9000": "OFFLINE"
    },
    "leadControllerResource_15": {
      "Controller_foo15a-rb32-36a.foo.foo.com_9000": "OFFLINE"
    },
    "leadControllerResource_16": {
      "Controller_foo15a-rb32-36a.foo.foo.com_9000": "OFFLINE"
    },
    "leadControllerResource_17": {
      "Controller_foo15a-rb32-36a.foo.foo.com_9000": "OFFLINE"
    },
    "leadControllerResource_18": {
      "Controller_foo15a-rb32-36a.foo.foo.com_9000": "OFFLINE"
    },
    "leadControllerResource_19": {
      "Controller_foo15a-rb32-36a.foo.foo.com_9000": "OFFLINE"
    },
    "leadControllerResource_2": {
      "Controller_foo15a-rb32-36a.foo.foo.com_9000": "OFFLINE"
    },
    "leadControllerResource_20": {
      "Controller_foo15a-rb32-36a.foo.foo.com_9000": "OFFLINE"
    },
    "leadControllerResource_21": {
      "Controller_foo15a-rb32-36a.foo.foo.com_9000": "OFFLINE"
    },
    "leadControllerResource_22": {
      "Controller_foo15a-rb32-36a.foo.foo.com_9000": "OFFLINE"
    },
    "leadControllerResource_23": {
      "Controller_foo15a-rb32-36a.foo.foo.com_9000": "OFFLINE"
    },
    "leadControllerResource_3": {
      "Controller_foo15a-rb32-36a.foo.foo.com_9000": "OFFLINE"
    },
    "leadControllerResource_4": {
      "Controller_foo15a-rb32-36a.foo.foo.com_9000": "OFFLINE"
    },
    "leadControllerResource_5": {
      "Controller_foo15a-rb32-36a.foo.foo.com_9000": "OFFLINE"
    },
    "leadControllerResource_6": {
      "Controller_foo15a-rb32-36a.foo.foo.com_9000": "OFFLINE"
    },
    "leadControllerResource_7": {
      "Controller_foo15a-rb32-36a.foo.foo.com_9000": "OFFLINE"
    },
    "leadControllerResource_8": {
      "Controller_foo15a-rb32-36a.foo.foo.com_9000": "OFFLINE"
    },
    "leadControllerResource_9": {
      "Controller_foo15a-rb32-36a.foo.foo.com_9000": "OFFLINE"
    }
  },
  "listFields": {}
}
MASTER
state?
j
right, they should have been in
MASTER
state
g
instead of?
j
instead of “OFFLINE”. can you check the controller log by searching those names like
leadControllerResource_9
?
g
last message I see is:
Copy code
2022/09/23 23:36:48.523 INFO [MessageDispatchStage] [HelixController-pipeline-default-PinotCluster-(289afc88_DEFAULT)] Event 289afc88_DEFAULT : Sending Message d6033d5d-d662-46c5-a7d1-b688a6cf58c2 to Controller_foo15a-rb32-36a.foo.foo.com_9000 transit leadControllerResource.leadControllerResource_9|[] from:OFFLINE to:SLAVE, relayMessages: 0
j
you should have seen some message saying from OFFLINE to MASTER, or from slave to master
g
before that I see:
Copy code
2022/09/23 23:29:04.378 INFO [AutoRebalanceStrategy] [HelixController-pipeline-default-PinotCluster-(36120192_DEFAULT)] orphan = [leadControllerResource_0|0, leadControllerResource_10|0, leadControllerResource_11|0, leadControllerResource_12|0, leadControllerResource_13|0, leadControllerResource_14|0, leadControllerResource_15|0, leadControllerResource_16|0, leadControllerResource_17|0, leadControllerResource_18|0, leadControllerResource_19|0, leadControllerResource_1|0, leadControllerResource_20|0, leadControllerResource_21|0, leadControllerResource_22|0, leadControllerResource_23|0, leadControllerResource_2|0, leadControllerResource_3|0, leadControllerResource_4|0, leadControllerResource_5|0, leadControllerResource_6|0, leadControllerResource_7|0, leadControllerResource_8|0, leadControllerResource_9|0]
no mention of MASTER in the whole log
🤔 1
j
how about from offline to slave?
g
well that’s what I just pasted
Copy code
E to:SLAVE, relayMessages: 0
2022/09/23 23:40:02.578 INFO [MessageDispatchStage] [HelixController-pipeline-default-PinotCluster-(8eb21edd_DEFAULT)] Event 8eb21edd_DEFAULT : Sending Message 17908ae8-9680-4de9-ad10-e62addcbd2e5 to Controller_foo15a-rb32-36a.foo.foo.com_9000 transit leadControllerResource.leadControllerResource_6|[] from:OFFLINE to:SLAVE, relayMessages: 0
2022/09/23 23:40:02.578 INFO [MessageDispatchStage] [HelixController-pipeline-default-PinotCluster-(8eb21edd_DEFAULT)] Event 8eb21edd_DEFAULT : Sending Message c9ad8be1-09d6-4675-9cbf-40fce1a22740 to Controller_foo15a-rb32-36a.foo.foo.com_9000 transit leadControllerResource.leadControllerResource_7|[] from:OFFLINE to:SLAVE, relayMessages: 0
2022/09/23 23:40:02.578 INFO [MessageDispatchStage] [HelixController-pipeline-default-PinotCluster-(8eb21edd_DEFAULT)] Event 8eb21edd_DEFAULT : Sending Message 227889e7-54f5-462a-a112-b0b6258a5c41 to Controller_foo15a-rb32-36a.foo.foo.com_9000 transit leadControllerResource.leadControllerResource_8|[] from:OFFLINE to:SLAVE, relayMessages: 0
2022/09/23 23:40:02.578 INFO [MessageDispatchStage] [HelixController-pipeline-default-PinotCluster-(8eb21edd_DEFAULT)] Event 8eb21edd_DEFAULT : Sending Message 5fa36192-d3e4-4493-9b9c-101d1d81c88f to Controller_foo15a-rb32-36a.foo.foo.com_9000 transit leadControllerResource.leadControllerResource_9|[] from:OFFLINE to:SLAVE, relayMessages: 0
j
any exception around those messsages?
g
no
j
could you also paste the content of znode
<yourcluster>/CONFIGS/RESOURCE/leadControllerResource
g
Copy code
{
  "id": "leadControllerResource",
  "simpleFields": {
    "RESOURCE_ENABLED": "true"
  },
  "mapFields": {},
  "listFields": {}
}
j
one last thing I might want you to do is to change the rebalance strategy to sth like below in the ideal state znode of `leadControllerResource`:
Copy code
"REBALANCE_STRATEGY" : "org.apache.helix.controller.rebalancer.strategy.CrushEdRebalanceStrategy",
and restart the controller with 0.11.0 again
g
I changed it in external view. should it be in
<yourcluster>/CONFIGS/RESOURCE/leadControllerResource
?
j
Nah it should be just ideal state not resource config
g
ok restarted
seeing the same messages:
Copy code
2022/09/23 23:51:44.329 INFO [MessageDispatchStage] [HelixController-pipeline-default-PinotCluster-(c217e5a8_DEFAULT)] Event c217e5a8_DEFAULT : Sending Message aeff7905-10b3-41f6-86ce-293fc532fd35 to Controller_sjc15a-rb32-36a.sjc.dropbox.com_9000 transit leadControllerResource.leadControllerResource_9|[] from:OFFLINE to:SLAVE, relayMessages: 0
j
Have you changed the one in idealstate as well as the external view?
g
they are the same now
when is a controller health considered “GOOD”?
j
Have you restarted controller after making the Znode update?
g
yes
s
@Jack does the zookeeper version matter? @Gerrit van Doorn what version of zookeeper are you using?
g
Good question, I read wondering the same thing. Is there a difference between 0.10 and 0.11? We have an process on top of etcd. Will have to reach out to see what API version it supports
The team responsible for that service says that it supports ZK 3.6
Is there a minimum version specified anywhere?