Hi Team, I need help on the following: I have a fe...
# troubleshooting
l
Hi Team, I need help on the following: I have a few segments that are in ERROR state. As i traced the logs, it is trying and failed to download the segment (which is present) from its peers instead of deep store. Error log:
Copy code
Caught exception while fetching segment from: <http://ip-10-110-217-232.ap-southeast-1.compute.internal:8097/segments/transportSurgeMirrorMetric_REALTIME/transportSurgeMirrorMetric__0__151__20221027T0840Z> to: /mnt/data/pinot/index/transportSurgeMirrorMetric_REALTIME/transportSurgeMirrorMetric__0__151__20221027T0840Z.tar.gz
This issue arise when i tried to restart all my Pinot’s servers. The segments are present in both servers which im not too sure why is it trying to download from its peer. Thanks in advance 🙏
m
Have you enabled peer download?
l
yes i did, also i enabled deep storage.
m
It should say why it is trying to download in the log?
l
The server is trying to transit the segment from OFFLINE to ONLINE state and caught an exception while loading segment which downloads a new copy which triggered an exception when downloading from deepstore. Controller:
Copy code
ERROR [MessageGenerationPhase] [HelixController-pipeline-default-stg-mimic-pinot-(e22acb6a_DEFAULT)] Event e22acb6a_DEFAULT : Unable to find a next state for resource: transportSurgeMirrorMetric_REALTIME partition: transportSurgeMirrorMetric__0__151__20221027T0840Z from stateModelDefinitionclass org.apache.helix.model.StateModelDefinition from:ERROR to:ONLINE
Broker logs:
Copy code
Failed to find servers hosting segment: transportSurgeMirrorMetric__0__151__20221027T0840Z for table: transportSurgeMirrorMetric_REALTIME (all ONLINE/CONSUMING instances: [Server_ip-10-110-217-232.ap-southeast-1.compute.internal_8098] and OFFLINE instances: [] are disabled, counting segment as unavailable)
Server logs below:
Copy code
2022/11/07 04:12:33.383 INFO [HelixStateTransitionHandler] [HelixTaskExecutor-message_handle_thread_10] Instance Server_ip-10-110-222-230.ap-southeast-1.compute.internal_8098, partition transportSurgeMirrorMetric__0__151__20221027T0840Z received state transition from OFFLINE to ONLINE on session 2007a773b0d0022, message id: defaccf1-dd6a-4d89-9a1f-dcdb5340d2e5
2022/11/07 04:12:33.521 ERROR [transportSurgeMirrorMetric_REALTIME-RealtimeTableDataManager] [HelixTaskExecutor-message_handle_thread_10] Caught exception while loading segment: transportSurgeMirrorMetric__0__151__20221027T0840Z, downloading a new copy
java.lang.RuntimeException: java.io.FileNotFoundException: /mnt/data/pinot/index/transportSurgeMirrorMetric_REALTIME/transportSurgeMirrorMetric__0__151__20221027T0840Z/v3/metadata.properties (Too many open files)
2022/11/07 04:12:33.537 INFO [S3PinotFS] [HelixTaskExecutor-message_handle_thread_10] Copy <s3://stg-pinot-archive/stg-mimic-pinot/controller-data/transportSurgeMirrorMetric/transportSurgeMirrorMetric__0__151__20221027T0840Z> to local /mnt/data/pinot/index/transportSurgeMirrorMetric_REALTIME/transportSurgeMirrorMetric__0__151__20221027T0840Z.tar.gz
2022/11/07 04:12:34.077 WARN [PinotFSSegmentFetcher] [HelixTaskExecutor-message_handle_thread_10] Caught exception while fetching segment from: <s3://stg-pinot-archive/stg-mimic-pinot/controller-data/transportSurgeMirrorMetric/transportSurgeMirrorMetric__0__151__20221027T0840Z> to: /mnt/data/pinot/index/transportSurgeMirrorMetric_REALTIME/transportSurgeMirrorMetric__0__151__20221027T0840Z.tar.gz
I was also thinking, how do i confirm that there were no data lost ?🤔
m
What’s the exception?
l
Just wondering if I am able to query despite having bad segments, does that mean all segments are available? or Pinot just not query those bad segments (skip some data). Context: have replication set to 2.
Hmm seems like there hit an exception (TOO MANY OPEN FILES) which is causing the issue. Maybe after server restart, it is trying to open a large number of segments on local disk.
Copy code
2022/11/07 04:12:33.521 ERROR [transportSurgeMirrorMetric_REALTIME-RealtimeTableDataManager] [HelixTaskExecutor-message_handle_thread_10] Caught exception while loading segment: transportSurgeMirrorMetric__0__151__20221027T0840Z, downloading a new copy
java.lang.RuntimeException: java.io.FileNotFoundException: /mnt/data/pinot/index/transportSurgeMirrorMetric_REALTIME/transportSurgeMirrorMetric__0__151__20221027T0840Z/v3/metadata.properties (Too many open files)
And it is unable to dl from deep store or from peer cause the segment is available on local and replacing the file is not allowed ? Resetting the segment solves the issue, as it brings it back to OFFLINE state and restart the process of finding the segment on local which now is successful in doing so.
x
Then you need to increase the system limit: http://woshub.com/too-many-open-files-error-linux/
l
Thanks for replying @Xiang Fu even when you’re on holiday 😅 Yep, i understand, but i can’t find the justification of increasing the limit. I ran ulimit -n to check the limit for open files per process which gives me
65536
. I have about 3453 segments, and assuming that all of the segment files are open, [columns.psf creation.meta index_map metadata.properties], it is around 14k open files which is far from the limit. Not too sure why it is hitting the limit
cc: @Michael Roman Wengle
x
you need to probe a running environment until it’s failed to check it