Hello guys, similar to this <thread> segments inge...
# troubleshooting
Hello guys, similar to this thread segments ingested are bad. Checking in the swagger debug endpoint the error is :
Caught exception in state transition from OFFLINE -> ONLINE for resource: adv1_OFFLINE, partition: adv1_OFFLINE_2022-03-01_2022-03-01_0
. This could be related to the data itself? or something like OOMs/resources like Mayank mentioned in the thread ? Any suggestion on how to debug it? Thanks
You’ll see the full stack trace and reason on the pinot-server logs. This is usually resource related or another common issue is not able to access deep store.
@User perhaps we need a doc page now, on debugging BAD segments? I dunno if you already have one
yes, based on the stack trace it seems like is unable to download the segment, In this case, I'm using local storage (deep store is not setup), and it seems like is trying to get the sement from deepstorage:
@User we don't have one, but yeh we should have one. But you'll have to help me with the content as I don't know how to debug/solve bad segments!
can you share the entire stack trace?
Yes, this the response from
isn’t there a longer form of this stack trace in the pinot-server logs, that has “Caused by” section?
i get this issue a lot when i’m moving pods around/restarting and whatn ot
Yes, this is the log from the server:
that’s strange. does it clear if you call resetSegment or restart the server?
can you check the controller conf and server conf to make sure there are no extra configs about FS?
Copy code
root@pinot-server-0:/var/pinot/server/config# cat pinot-server.conf 
Copy code
root@pinot-controller-0:/opt/pinot# cat /var/pinot/controller/config/pinot-controller.conf 
Tried reset segment? Also I've generally noticed in such an exception, there's a warn several log lines before. Anything before the exception?
Hello Neha, I just tried resetting the segments:
Copy code
$ curl -X POST "<http://localhost:9001/segments/adv1_OFFLINE/reset>" -H  "accept: application/json"
  "status": "Successfully reset all segments of table: adv1_OFFLINE"
then in server logs I found similar stack trace:
Copy code
2022/04/12 12:07:03.473 ERROR [HelixStateTransitionHandler] [HelixTaskExecutor-message_handle_thread] Exception while executing a state transition task adv1_OFFLINE_2022-03-01_2022-03-01_1            │
│ java.lang.reflect.InvocationTargetException: null                                                                                                                                                       │
│     at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]                                                                                                                      │
│     at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:?]                                                                                                    │
│     at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]                                                                                            │
│     at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]                                                                                                                                          │
│     at org.apache.helix.messaging.handling.HelixStateTransitionHandler.invoke(HelixStateTransitionHandler.java:404) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd │
│     at org.apache.helix.messaging.handling.HelixStateTransitionHandler.handleMessage(HelixStateTransitionHandler.java:331) [pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63 │
│     at org.apache.helix.messaging.handling.HelixTask.call(HelixTask.java:97) [pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]                               │
│     at org.apache.helix.messaging.handling.HelixTask.call(HelixTask.java:49) [pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]                               │
│     at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]                                                                                                                                   │
│     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]                                                                                                            │
│     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]                                                                                                            │
│     at java.lang.Thread.run(Thread.java:829) [?:?]                                                                                                                                                      │
│ Caused by: org.apache.pinot.spi.utils.retry.AttemptsExceededException: Operation failed after 3 attempts                                                                                                │
│     at org.apache.pinot.spi.utils.retry.BaseRetryPolicy.attempt(BaseRetryPolicy.java:61) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]                  │
│     at org.apache.pinot.common.utils.fetcher.BaseSegmentFetcher.fetchSegmentToLocal(BaseSegmentFetcher.java:72) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a65 │
│     at org.apache.pinot.common.utils.fetcher.SegmentFetcherFactory.fetchSegmentToLocalInternal(SegmentFetcherFactory.java:148) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9 │
│     at org.apache.pinot.common.utils.fetcher.SegmentFetcherFactory.fetchSegmentToLocal(SegmentFetcherFactory.java:142) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93 │
│     at org.apache.pinot.common.utils.fetcher.SegmentFetcherFactory.fetchAndDecryptSegmentToLocalInternal(SegmentFetcherFactory.java:164) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfe │
│     at org.apache.pinot.common.utils.fetcher.SegmentFetcherFactory.fetchAndDecryptSegmentToLocal(SegmentFetcherFactory.java:158) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88a │
│     at org.apache.pinot.core.data.manager.BaseTableDataManager.downloadAndDecrypt(BaseTableDataManager.java:406) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a6 │
│     at org.apache.pinot.core.data.manager.BaseTableDataManager.downloadSegmentFromDeepStore(BaseTableDataManager.java:393) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f6 │
│     at org.apache.pinot.core.data.manager.BaseTableDataManager.downloadSegment(BaseTableDataManager.java:385) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a6506 │
│     at org.apache.pinot.core.data.manager.BaseTableDataManager.addOrReplaceSegment(BaseTableDataManager.java:372) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a │
│     at org.apache.pinot.server.starter.helix.HelixInstanceDataManager.addOrReplaceSegment(HelixInstanceDataManager.java:355) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9 │
│     at org.apache.pinot.server.starter.helix.SegmentOnlineOfflineStateModelFactory$SegmentOnlineOfflineStateModel.onBecomeOnlineFromOffline(SegmentOnlineOfflineStateModelFactory.java:162) ~[pinot-all │
│     ... 12 more                                                                                                                                                                                         │
Okay.. can you share the entire controller and server log file? I'll see if there are any clues in the rest of the logs
I tried with different input files, and now there is a mix of bad and good segments. it could be something related to the input data? date format or JSON format?
it can definitely be related to JSON format. Usually indexes are built on the server after downloading so bad data can make the loading on server fail. But from the exception it sounds more like a problem during fetch. Is it possible that the segment was deleted from the source before it was downloaded? the full log would really help here, without that, it’s just going to be guess work.. 😕 even if you can grep for all ERROR/WARN and Exception
Hey Neha, running all the steps from the scratch and saving the logs I found this:
Copy code
2022/04/13 18:40:17.304 ERROR [SegmentOnlineOfflineStateModelFactory$SegmentOnlineOfflineStateModel] [HelixTaskExecutor-message_handle_thread] Caught exception in state transition from OFFLINE -> ONL │
│ java.io.IOException: No space left on device                                                                                                                                                            
│     at sun.nio.ch.FileDispatcherImpl.write0(Native Method) ~[?:?]                                                                                                                                       
│     at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:62) ~[?:?]
So, after increasing the server POD disk size, I was able to ingest 167 segments. Some of them were bad, but after removing them, the table was ready for queries. Thanks for the help!
oh i’m glad. curious though, was this exception not seen anywhere in the previous logs? i was hoping we’d see something like this very obviously
Yes, that log also appears when I tried to reset the segments