How can I repair a segment external view? The segm...
# troubleshooting
d
How can I repair a segment external view? The segment is allocated to 2 servers out of 3 and is offline for both, the file exists in deep storage and the ideal view looks alright. I've tried reset and reload, I've also restarted the controllers and servers but the state won't change
n
Any exceptions in the server logs while restarting, related to that segment?
d
There is this error in the controller logs
Copy code
{
  "id": "1000859de3b0088__audit_event_REALTIME",
  "simpleFields": {},
  "mapFields": {
    "HELIX_ERROR 20220307-142805.000429 STATE_TRANSITION 803412fe-7256-4e37-ac94-ef1fa200d202": {
      "AdditionalInfo": "Exception while executing a state transition task audit_event__15__19__20220305T2036Zjava.lang.reflect.InvocationTargetException\n\tat jdk.internal.reflect.GeneratedMethodAccessor60.invoke(Unknown Source)\n\tat java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.base/java.lang.reflect.Method.invoke(Method.java:566)\n\tat org.apache.helix.messaging.handling.HelixStateTransitionHandler.invoke(HelixStateTransitionHandler.java:404)\n\tat org.apache.helix.messaging.handling.HelixStateTransitionHandler.handleMessage(HelixStateTransitionHandler.java:331)\n\tat org.apache.helix.messaging.handling.HelixTask.call(HelixTask.java:97)\n\tat org.apache.helix.messaging.handling.HelixTask.call(HelixTask.java:49)\n\tat java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\n\tat java.base/java.lang.Thread.run(Thread.java:829)\nCaused by: java.lang.RuntimeException: org.apache.pinot.spi.utils.retry.RetriableOperationException: java.lang.IllegalArgumentException: bound must be positive\n\tat org.apache.pinot.core.data.manager.realtime.RealtimeTableDataManager.downloadSegmentFromPeer(RealtimeTableDataManager.java:498)\n\tat org.apache.pinot.core.data.manager.realtime.RealtimeTableDataManager.downloadAndReplaceSegment(RealtimeTableDataManager.java:434)\n\tat org.apache.pinot.core.data.manager.realtime.RealtimeTableDataManager.addSegment(RealtimeTableDataManager.java:336)\n\tat org.apache.pinot.server.starter.helix.HelixInstanceDataManager.addRealtimeSegment(HelixInstanceDataManager.java:162)\n\tat org.apache.pinot.server.starter.helix.SegmentOnlineOfflineStateModelFactory$SegmentOnlineOfflineStateModel.onBecomeOnlineFromOffline(SegmentOnlineOfflineStateModelFactory.java:168)\n\t... 11 more\nCaused by: org.apache.pinot.spi.utils.retry.RetriableOperationException: java.lang.IllegalArgumentException: bound must be positive\n\tat org.apache.pinot.spi.utils.retry.BaseRetryPolicy.attempt(BaseRetryPolicy.java:58)\n\tat org.apache.pinot.common.utils.fetcher.BaseSegmentFetcher.fetchSegmentToLocal(BaseSegmentFetcher.java:91)\n\tat org.apache.pinot.core.data.manager.realtime.RealtimeTableDataManager.downloadSegmentFromPeer(RealtimeTableDataManager.java:492)\n\t... 15 more\nCaused by: java.lang.IllegalArgumentException: bound must be positive\n\tat java.base/java.util.Random.nextInt(Random.java:388)\n\tat org.apache.pinot.common.utils.fetcher.BaseSegmentFetcher.lambda$fetchSegmentToLocal$1(BaseSegmentFetcher.java:92)\n\tat org.apache.pinot.spi.utils.retry.BaseRetryPolicy.attempt(BaseRetryPolicy.java:50)\n\t... 17 more\n",
      "Class": "class org.apache.helix.messaging.handling.HelixStateTransitionHandler",
      "MSG_ID": "a9695cee-986c-4a8f-b1c8-59edfed00953",
      "Message state": "READ"
    },
    "HELIX_ERROR 20220307-142806.000618 STATE_TRANSITION 6dc67dfb-c952-4899-b46d-c4e7c2a829e6": {
      "AdditionalInfo": "Message execution failed. msgId: a9695cee-986c-4a8f-b1c8-59edfed00953, errorMsg: java.lang.reflect.InvocationTargetException",
      "Class": "class org.apache.helix.messaging.handling.HelixStateTransitionHandler",
      "MSG_ID": "a9695cee-986c-4a8f-b1c8-59edfed00953",
      "Message state": "READ"
    }
  },
  "listFields": {}
}
Let me add more context... I was stressing the table to measure cluster resources for our use case. The server ran out heap but they kept running, is like helix swallowed the out of memory error
That segment was left in a funny state where the segment url in the metadata was blank and the file had a UUID suffix in deep storage. I renamed the file by removing the UUID and updated the segment metadata in ZK. Then I called the reset segment API
I did the same for other segments and they were fine, but that didn't work for this segment