https://pinot.apache.org/ logo
e

Elon

03/03/2021, 6:53 PM
Hi, we have an issue where the pinot servers are in a crash loop, they cannot start up. The servers are spewing tons of messages like :
Copy code
[HelixTaskExecutor] [ZkClient-EventThread-23-pinot-us-central1-zookeeper:2181] SessionId does NOT match. expected sessionId: 300000c69e5009a, tgtSessionId in message: 300000c69e50099, messageId: 9d191304-00cc-4138-bb57-7997a960fab0
When I look in the errors section of the zookeeper browser I see:
Copy code
"id": "300000c69e50084__enriched_customer_orders_jp_upsert_realtime_streaming_v1_REALTIME",
  "simpleFields": {},
  "mapFields": {
    "HELIX_ERROR     20210303-100525.000929 STATE_TRANSITION 7f8da719-5667-4d33-adb9-76a8010c9c56": {
      "AdditionalInfo": "Exception while executing a state transition task enriched_customer_orders_jp_upsert_realtime_streaming_v1__7__330__20210224T2322Zjava.lang.reflect.InvocationTargetException\n\tat jdk.internal.reflect.GeneratedMethodAccessor452.invoke(Unknown Source)\n\tat java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.base/java.lang.reflect.Method.invoke(Method.java:566)\n\tat org.apache.helix.messaging.handling.HelixStateTransitionHandler.invoke(HelixStateTransitionHandler.java:404)\n\tat org.apache.helix.messaging.handling.HelixStateTransitionHandler.handleMessage(HelixStateTransitionHandler.java:331)\n\tat org.apache.helix.messaging.handling.HelixTask.call(HelixTask.java:97)\n\tat org.apache.helix.messaging.handling.HelixTask.call(HelixTask.java:49)\n\tat java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\n\tat java.base/java.lang.Thread.run(Thread.java:834)\nCaused by: java.util.NoSuchElementException: 'segment.total.docs' doesn't map to an existing object\n\tat org.apache.commons.configuration.AbstractConfiguration.getInt(AbstractConfiguration.java:816)\n\tat org.apache.pinot.core.segment.index.metadata.SegmentMetadataImpl.<init>(SegmentMetadataImpl.java:128)\n\tat org.apache.pinot.core.segment.index.loader.SegmentPreProcessor.<init>(SegmentPreProcessor.java:71)\n\tat org.apache.pinot.core.indexsegment.immutable.ImmutableSegmentLoader.load(ImmutableSegmentLoader.java:98)\n\tat org.apache.pinot.core.data.manager.realtime.RealtimeTableDataManager.addSegment(RealtimeTableDataManager.java:283)\n\tat org.apache.pinot.server.starter.helix.HelixInstanceDataManager.addRealtimeSegment(HelixInstanceDataManager.java:133)\n\tat org.apache.pinot.server.starter.helix.SegmentOnlineOfflineStateModelFactory$SegmentOnlineOfflineStateModel.onBecomeOnlineFromOffline(SegmentOnlineOfflineStateModelFactory.java:164)\n\t... 11 more\n",
      "Class": "class org.apache.helix.messaging.handling.HelixStateTransitionHandler",
      "MSG_ID": "8237ad10-da30-4ad9-8b80-930b437d48fa",
      "Message state": "READ"
    },
    "HELIX_ERROR     20210303-100532.000104 STATE_TRANSITION 24244f32-ca45-463b-9c15-586d5a667669": {
      "AdditionalInfo": "Message execution failed. msgId: 8237ad10-da30-4ad9-8b80-930b437d48fa, errorMsg: java.lang.reflect.InvocationTargetException",
      "Class": "class org.apache.helix.messaging.handling.HelixStateTransitionHandler",
      "MSG_ID": "8237ad10-da30-4ad9-8b80-930b437d48fa",
      "Message state": "READ"
    }
j

Jackie

03/03/2021, 8:39 PM
Based on the error message, seems the segment
enriched_customer_orders_jp_upsert_realtime_streaming_v1__7__330__20210224T2322Z
is clasped.
Does this happen to only one server or all servers?
e

Elon

03/03/2021, 8:41 PM
only the tenants where it exists.
j

Jackie

03/03/2021, 8:41 PM
If you have time, we can have a quick zoom chat to debug the issue
e

Elon

03/03/2021, 8:41 PM
wow, I owe you one:) Sure whenever you have some time.
s

Shubham Kumar

04/04/2023, 12:57 PM
Hi @Jackie @Elon, we are also running into similar issues while restarting pinot cluster and not able to resolve it. can you guys recall what was the solution for this issue
e

Elon

04/04/2023, 2:27 PM
Pinot has improved by leaps and bounds since then:) What is the error you’re getting?
What version of Pinot are you using? Which components do not start?
s

Shubham Kumar

04/04/2023, 6:17 PM
We are using version 0.11.0, currently
pinot servers
are not starting with this error :
Copy code
2023/04/04 18:04:01.828 WARN [HelixTaskExecutor] [Start a Pinot [SERVER]] SessionId does NOT match. expected sessionId: 10000c327fc00ca, tgtSessionId in message: 10000c327fc00b7, messageId: d18c1dbc-d573-46b0-b97f-66e0131b31c6
2023/04/04 18:04:01.831 WARN [HelixTaskExecutor] [Start a Pinot [SERVER]] SessionId does NOT match. expected sessionId: 10000c327fc00ca, tgtSessionId in message: 10000c327fc00b7, messageId: fc9d6eeb-4c38-4e98-9f92-8a57be47b67c
2023/04/04 18:04:01.834 WARN [HelixTaskExecutor] [Start a Pinot [SERVER]] SessionId does NOT match. expected sessionId: 10000c327fc00ca, tgtSessionId in message: 10000c327fc00b7, messageId: b82e4eca-14e3-4bc5-a07a-f3ec079d82e3

2023/04/04 18:17:18.502 WARN [ZkClient] [Start a Pinot [SERVER]] zkclient 3, Failed to delete path /pinot-nonprod/INSTANCES/Server_pinot2-server-0.pinot2-server-headless.np.svc.cluster.local_8098/CURRENTSTATES/20001a0c21e00a9! 
Apr 4, 2023 @ 23:47:18.507	at org.apache.helix.zookeeper.zkclient.ZkClient.delete(ZkClient.java:2058) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
Apr 4, 2023 @ 23:47:18.507	at org.apache.helix.manager.zk.ZkBaseDataAccessor.remove(ZkBaseDataAccessor.java:727) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
Apr 4, 2023 @ 23:47:18.507	at org.apache.helix.manager.zk.ParticipantManager.handleNewSession(ParticipantManager.java:162) ~[pinot-all-0.11.0-jar-with-dependencies.jar:0.11.0-1b4d6b6b0a27422c1552ea1a936ad145056f7033]
In the controller host getting this error related to connection with the servers :
Copy code
2023/04/04 18:12:07.248 WARN [MultiHttpRequest] [async-task-thread-125] Caught 'java.net.UnknownHostException: pinot2-server-0.pinot2-server-headless.np.svc.cluster.local' while executing: GET on URL: <http://pinot2-server-0.pinot2-server-headless.np.svc.cluster.local:8097/table/li_ss_append_OFFLINE/size>
2023/04/04 18:16:09.201 WARN [ZKHelixManager] [prometheus-http-1-3] Instance pinot2-controller-1.pinot2-controller-headless.np.svc.cluster.local_9000 is not leader of cluster pinot-nonprod due to current session 10000c327fc0013 does not match leader session 10000c327fc0016