Hello guys similar to this <https apache pinot slack com arc Apache Pinot #troubleshooting

Hello guys, similar to this <thread> segments inge...

Eduardo Cusa

04/11/2022, 12:59 PM

Hello guys, similar to this thread segments ingested are bad. Checking in the swagger debug endpoint the error is :

Caught exception in state transition from OFFLINE -> ONLINE for resource: adv1_OFFLINE, partition: adv1_OFFLINE_2022-03-01_2022-03-01_0

. This could be related to the data itself? or something like OOMs/resources like Mayank mentioned in the thread ? Any suggestion on how to debug it? Thanks

Neha Pawar

04/11/2022, 4:28 PM

You’ll see the full stack trace and reason on the pinot-server logs. This is usually resource related or another common issue is not able to access deep store.

Neha Pawar

04/11/2022, 4:29 PM

@User perhaps we need a doc page now, on debugging BAD segments? I dunno if you already have one

Eduardo Cusa

04/11/2022, 5:58 PM

yes, based on the stack trace it seems like is unable to download the segment, In this case, I'm using local storage (deep store is not setup), and it seems like is trying to get the sement from deepstorage:

org.apache.pinot.core.data.manager.BaseTableDataManager.downloadSegmentFromDeepStore(BaseTableDataManager.java:393)

Mark Needham

04/11/2022, 5:59 PM

@User we don't have one, but yeh we should have one. But you'll have to help me with the content as I don't know how to debug/solve bad segments!

Neha Pawar

04/11/2022, 6:01 PM

can you share the entire stack trace?

Eduardo Cusa

04/11/2022, 6:03 PM

Yes, this the response from

<http://localhost:9000/debug/tables/adv1?type=OFFLINE&verbosity=0>

debug.adv1.json

Neha Pawar

04/11/2022, 6:13 PM

isn’t there a longer form of this stack trace in the pinot-server logs, that has “Caused by” section?

Luis Fernandez

04/11/2022, 6:42 PM

i get this issue a lot when i’m moving pods around/restarting and whatn ot

Eduardo Cusa

04/11/2022, 7:30 PM

Yes, this is the log from the server:

trace.adv1.log

Neha Pawar

04/11/2022, 8:12 PM

that’s strange. does it clear if you call resetSegment or restart the server?

Neha Pawar

04/11/2022, 8:13 PM

can you check the controller conf and server conf to make sure there are no extra configs about FS?

Eduardo Cusa

04/11/2022, 8:54 PM

Server:

Copy code

root@pinot-server-0:/var/pinot/server/config# cat pinot-server.conf 
pinot.server.netty.port=8098
pinot.server.adminapi.port=8097
pinot.server.instance.dataDir=/var/pinot/server/data/index
pinot.server.instance.segmentTarDir=/var/pinot/server/data/segment

Controller:

Copy code

root@pinot-controller-0:/opt/pinot# cat /var/pinot/controller/config/pinot-controller.conf 
controller.helix.cluster.name=pinot-test
controller.port=9000
controller.vip.host=pinot-controller
controller.vip.port=9000
controller.data.dir=/var/pinot/controller/data
controller.zk.str=pinot-zookeeper:2181

Neha Pawar

04/12/2022, 5:08 AM

Tried reset segment? Also I've generally noticed in such an exception, there's a warn several log lines before. Anything before the exception?

Eduardo Cusa

04/12/2022, 12:09 PM

Hello Neha, I just tried resetting the segments:

Copy code

$ curl -X POST "<http://localhost:9001/segments/adv1_OFFLINE/reset>" -H  "accept: application/json"
{
  "status": "Successfully reset all segments of table: adv1_OFFLINE"
}

then in server logs I found similar stack trace:

Copy code

2022/04/12 12:07:03.473 ERROR [HelixStateTransitionHandler] [HelixTaskExecutor-message_handle_thread] Exception while executing a state transition task adv1_OFFLINE_2022-03-01_2022-03-01_1            │
│ java.lang.reflect.InvocationTargetException: null                                                                                                                                                       │
│     at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]                                                                                                                      │
│     at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:?]                                                                                                    │
│     at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]                                                                                            │
│     at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]                                                                                                                                          │
│     at org.apache.helix.messaging.handling.HelixStateTransitionHandler.invoke(HelixStateTransitionHandler.java:404) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd │
│     at org.apache.helix.messaging.handling.HelixStateTransitionHandler.handleMessage(HelixStateTransitionHandler.java:331) [pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63 │
│     at org.apache.helix.messaging.handling.HelixTask.call(HelixTask.java:97) [pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]                               │
│     at org.apache.helix.messaging.handling.HelixTask.call(HelixTask.java:49) [pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]                               │
│     at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]                                                                                                                                   │
│     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]                                                                                                            │
│     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]                                                                                                            │
│     at java.lang.Thread.run(Thread.java:829) [?:?]                                                                                                                                                      │
│ Caused by: org.apache.pinot.spi.utils.retry.AttemptsExceededException: Operation failed after 3 attempts                                                                                                │
│     at org.apache.pinot.spi.utils.retry.BaseRetryPolicy.attempt(BaseRetryPolicy.java:61) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]                  │
│     at org.apache.pinot.common.utils.fetcher.BaseSegmentFetcher.fetchSegmentToLocal(BaseSegmentFetcher.java:72) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a65 │
│     at org.apache.pinot.common.utils.fetcher.SegmentFetcherFactory.fetchSegmentToLocalInternal(SegmentFetcherFactory.java:148) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9 │
│     at org.apache.pinot.common.utils.fetcher.SegmentFetcherFactory.fetchSegmentToLocal(SegmentFetcherFactory.java:142) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93 │
│     at org.apache.pinot.common.utils.fetcher.SegmentFetcherFactory.fetchAndDecryptSegmentToLocalInternal(SegmentFetcherFactory.java:164) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfe │
│     at org.apache.pinot.common.utils.fetcher.SegmentFetcherFactory.fetchAndDecryptSegmentToLocal(SegmentFetcherFactory.java:158) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88a │
│     at org.apache.pinot.core.data.manager.BaseTableDataManager.downloadAndDecrypt(BaseTableDataManager.java:406) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a6 │
│     at org.apache.pinot.core.data.manager.BaseTableDataManager.downloadSegmentFromDeepStore(BaseTableDataManager.java:393) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f6 │
│     at org.apache.pinot.core.data.manager.BaseTableDataManager.downloadSegment(BaseTableDataManager.java:385) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a6506 │
│     at org.apache.pinot.core.data.manager.BaseTableDataManager.addOrReplaceSegment(BaseTableDataManager.java:372) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a │
│     at org.apache.pinot.server.starter.helix.HelixInstanceDataManager.addOrReplaceSegment(HelixInstanceDataManager.java:355) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9 │
│     at org.apache.pinot.server.starter.helix.SegmentOnlineOfflineStateModelFactory$SegmentOnlineOfflineStateModel.onBecomeOnlineFromOffline(SegmentOnlineOfflineStateModelFactory.java:162) ~[pinot-all │
│     ... 12 more                                                                                                                                                                                         │
│

Neha Pawar

04/12/2022, 3:17 PM

Okay.. can you share the entire controller and server log file? I'll see if there are any clues in the rest of the logs

Eduardo Cusa

04/12/2022, 7:45 PM

I tried with different input files, and now there is a mix of bad and good segments. it could be something related to the input data? date format or JSON format?

Neha Pawar

04/12/2022, 8:21 PM

it can definitely be related to JSON format. Usually indexes are built on the server after downloading so bad data can make the loading on server fail. But from the exception it sounds more like a problem during fetch. Is it possible that the segment was deleted from the source before it was downloaded? the full log would really help here, without that, it’s just going to be guess work.. 😕 even if you can grep for all ERROR/WARN and Exception

Eduardo Cusa

04/13/2022, 7:14 PM

Hey Neha, running all the steps from the scratch and saving the logs I found this:

Copy code

2022/04/13 18:40:17.304 ERROR [SegmentOnlineOfflineStateModelFactory$SegmentOnlineOfflineStateModel] [HelixTaskExecutor-message_handle_thread] Caught exception in state transition from OFFLINE -> ONL │
│ java.io.IOException: No space left on device                                                                                                                                                            
│     at sun.nio.ch.FileDispatcherImpl.write0(Native Method) ~[?:?]                                                                                                                                       
│     at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:62) ~[?:?]

So, after increasing the server POD disk size, I was able to ingest 167 segments. Some of them were bad, but after removing them, the table was ready for queries. Thanks for the help!

Neha Pawar

04/13/2022, 7:16 PM

oh i’m glad. curious though, was this exception not seen anywhere in the previous logs? i was hoping we’d see something like this very obviously

Eduardo Cusa

04/13/2022, 7:21 PM

Yes, that log also appears when I tried to reset the segments

Open in Slack

Previous Next