Hi Team I need help on the following I have a few segments t Apache Pinot #troubleshooting

Hi Team, I need help on the following: I have a fe...

Lee Wei Hern Jason

11/07/2022, 7:08 AM

Hi Team, I need help on the following: I have a few segments that are in ERROR state. As i traced the logs, it is trying and failed to download the segment (which is present) from its peers instead of deep store. Error log:

Copy code

Caught exception while fetching segment from: <http://ip-10-110-217-232.ap-southeast-1.compute.internal:8097/segments/transportSurgeMirrorMetric_REALTIME/transportSurgeMirrorMetric__0__151__20221027T0840Z> to: /mnt/data/pinot/index/transportSurgeMirrorMetric_REALTIME/transportSurgeMirrorMetric__0__151__20221027T0840Z.tar.gz

This issue arise when i tried to restart all my Pinot’s servers. The segments are present in both servers which im not too sure why is it trying to download from its peer. Thanks in advance 🙏

Mayank

11/07/2022, 2:13 PM

Have you enabled peer download?

Lee Wei Hern Jason

11/07/2022, 3:12 PM

yes i did, also i enabled deep storage.

Mayank

11/07/2022, 3:20 PM

It should say why it is trying to download in the log?

Lee Wei Hern Jason

11/07/2022, 3:45 PM

The server is trying to transit the segment from OFFLINE to ONLINE state and caught an exception while loading segment which downloads a new copy which triggered an exception when downloading from deepstore. Controller:

Copy code

ERROR [MessageGenerationPhase] [HelixController-pipeline-default-stg-mimic-pinot-(e22acb6a_DEFAULT)] Event e22acb6a_DEFAULT : Unable to find a next state for resource: transportSurgeMirrorMetric_REALTIME partition: transportSurgeMirrorMetric__0__151__20221027T0840Z from stateModelDefinitionclass org.apache.helix.model.StateModelDefinition from:ERROR to:ONLINE

Broker logs:

Copy code

Failed to find servers hosting segment: transportSurgeMirrorMetric__0__151__20221027T0840Z for table: transportSurgeMirrorMetric_REALTIME (all ONLINE/CONSUMING instances: [Server_ip-10-110-217-232.ap-southeast-1.compute.internal_8098] and OFFLINE instances: [] are disabled, counting segment as unavailable)

Server logs below:

Copy code

2022/11/07 04:12:33.383 INFO [HelixStateTransitionHandler] [HelixTaskExecutor-message_handle_thread_10] Instance Server_ip-10-110-222-230.ap-southeast-1.compute.internal_8098, partition transportSurgeMirrorMetric__0__151__20221027T0840Z received state transition from OFFLINE to ONLINE on session 2007a773b0d0022, message id: defaccf1-dd6a-4d89-9a1f-dcdb5340d2e5
2022/11/07 04:12:33.521 ERROR [transportSurgeMirrorMetric_REALTIME-RealtimeTableDataManager] [HelixTaskExecutor-message_handle_thread_10] Caught exception while loading segment: transportSurgeMirrorMetric__0__151__20221027T0840Z, downloading a new copy
java.lang.RuntimeException: java.io.FileNotFoundException: /mnt/data/pinot/index/transportSurgeMirrorMetric_REALTIME/transportSurgeMirrorMetric__0__151__20221027T0840Z/v3/metadata.properties (Too many open files)
2022/11/07 04:12:33.537 INFO [S3PinotFS] [HelixTaskExecutor-message_handle_thread_10] Copy <s3://stg-pinot-archive/stg-mimic-pinot/controller-data/transportSurgeMirrorMetric/transportSurgeMirrorMetric__0__151__20221027T0840Z> to local /mnt/data/pinot/index/transportSurgeMirrorMetric_REALTIME/transportSurgeMirrorMetric__0__151__20221027T0840Z.tar.gz
2022/11/07 04:12:34.077 WARN [PinotFSSegmentFetcher] [HelixTaskExecutor-message_handle_thread_10] Caught exception while fetching segment from: <s3://stg-pinot-archive/stg-mimic-pinot/controller-data/transportSurgeMirrorMetric/transportSurgeMirrorMetric__0__151__20221027T0840Z> to: /mnt/data/pinot/index/transportSurgeMirrorMetric_REALTIME/transportSurgeMirrorMetric__0__151__20221027T0840Z.tar.gz

Lee Wei Hern Jason

11/08/2022, 7:57 AM

I was also thinking, how do i confirm that there were no data lost ?🤔

Mayank

11/08/2022, 7:16 PM

What’s the exception?

Lee Wei Hern Jason

11/09/2022, 12:13 AM

Just wondering if I am able to query despite having bad segments, does that mean all segments are available? or Pinot just not query those bad segments (skip some data). Context: have replication set to 2.

Lee Wei Hern Jason

11/09/2022, 9:43 AM

Hmm seems like there hit an exception (TOO MANY OPEN FILES) which is causing the issue. Maybe after server restart, it is trying to open a large number of segments on local disk.

Copy code

2022/11/07 04:12:33.521 ERROR [transportSurgeMirrorMetric_REALTIME-RealtimeTableDataManager] [HelixTaskExecutor-message_handle_thread_10] Caught exception while loading segment: transportSurgeMirrorMetric__0__151__20221027T0840Z, downloading a new copy
java.lang.RuntimeException: java.io.FileNotFoundException: /mnt/data/pinot/index/transportSurgeMirrorMetric_REALTIME/transportSurgeMirrorMetric__0__151__20221027T0840Z/v3/metadata.properties (Too many open files)

And it is unable to dl from deep store or from peer cause the segment is available on local and replacing the file is not allowed ? Resetting the segment solves the issue, as it brings it back to OFFLINE state and restart the process of finding the segment on local which now is successful in doing so.

Xiang Fu

11/10/2022, 8:32 AM

Then you need to increase the system limit: http://woshub.com/too-many-open-files-error-linux/

Lee Wei Hern Jason

11/10/2022, 9:07 AM

Thanks for replying @Xiang Fu even when you’re on holiday 😅 Yep, i understand, but i can’t find the justification of increasing the limit. I ran ulimit -n to check the limit for open files per process which gives me

. I have about 3453 segments, and assuming that all of the segment files are open, [columns.psf creation.meta index_map metadata.properties], it is around 14k open files which is far from the limit. Not too sure why it is hitting the limit

Lee Wei Hern Jason

11/10/2022, 9:41 AM

cc: @Michael Roman Wengle

Xiang Fu

11/15/2022, 12:41 PM

you need to probe a running environment until it’s failed to check it

Open in Slack

Previous Next