Hey folks, a bunch of segments are in BAD state in...
# troubleshooting
b
Hey folks, a bunch of segments are in BAD state in Pinot and reloading all the segments from the UI didn’t help. Trying to debug it further but didn’t see any warnings or errors in the logs. Any hints on where to check? Any usual suspects?
m
Server log should definitely have the error upon loading
x
can you try to reload one bad segment from the server and check the log from it ?
the log file pinotServer.log inside the container should tell
k
When I wound up with bad segments due to OOME with building inverted indexes, the generated files were invalid. So doing a reload would fail due to “invalid metadata” errors from the segments.
b
How did you get out of that then @Ken Krugler?
thanks @Xiang Fu I didn’t check pinotServer.log
x
Oh? What’s the exact metadata invalid exception? We should fix that
m
For @Ken Krugler the inv index generation at segment load failed I think -OOM due to large inv index size
b
Just curious, what does this destroy message mean?
Copy code
[pinot-server-2 pinot-server] 2021/01/24 05:04:23.421 INFO [ImmutableSegmentImpl] [HelixTaskExecutor-message_handle_thread] Trying to destroy segment : spanEventView__7__6062__20210122T1653Z
And, if there are some BAD segments for a table, would the consumption from the table continue to happen or not?
the state where we’re struck is, two tables have a bunch of BAD segments and the kafka consumer lag on those two topics is continuing to increase
m
Destroy message displayed when Closing an open segment
b
So, why is the segment being closed when reloaded? maybe i don’t understand the full details of it
m
Can you for the segment name to get all the log messages about it
b
Copy code
[pinot-server-0 pinot-server] 2021/01/24 05:11:42.975 INFO [HelixTaskExecutor] [ZkClient-EventThread-22-zookeeper.traceable.svc.cluster.local:2181/pinot] Scheduling message 59d16855-2d7e-418c-9e59-ee842decb97b: spanEventView_REALTIME:spanEventView__4__6651__20210124T0507Z, null->null


[pinot-server-0 pinot-server] 2021/01/24 05:11:42.976 INFO [spanEventView_REALTIME-SegmentReloadMessageHandler] [HelixTaskExecutor-message_handle_thread] Handling message: ZnRecord=59d16855-2d7e-418c-9e59-ee842decb97b, {CREATE_TIMESTAMP=1611465102654, EXECUTE_START_TIMESTAMP=1611465102975, MSG_ID=59d16855-2d7e-418c-9e59-ee842decb97b, MSG_STATE=new, MSG_SUBTYPE=RELOAD_SEGMENT, MSG_TYPE=USER_DEFINE_MSG, PARTITION_NAME=spanEventView__4__6651__20210124T0507Z, RESOURCE_NAME=spanEventView_REALTIME, RETRY_COUNT=0, SRC_CLUSTER=traceable-views, SRC_INSTANCE_TYPE=PARTICIPANT, SRC_NAME=Controller_pinot-controller-1.pinot-controller.traceable.svc.cluster.local_9000, TGT_NAME=Server_pinot-server-0.pinot-server.traceable.svc.cluster.local_8098, TGT_SESSION_ID=3760dff56d50350, TIMEOUT=-1}{}{}, Stat=Stat {_version=0, _creationTime=1611465102938, _modifiedTime=1611465102938, _ephemeralOwner=0}
[pinot-server-0 pinot-server] 2021/01/24 05:11:42.976 INFO [spanEventView_REALTIME-SegmentReloadMessageHandler] [HelixTaskExecutor-message_handle_thread] Waiting for lock to refresh : spanEventView__4__6651__20210124T0507Z, queue-length: 0
[pinot-server-0 pinot-server] 2021/01/24 05:11:42.976 INFO [spanEventView_REALTIME-SegmentReloadMessageHandler] [HelixTaskExecutor-message_handle_thread] Acquired lock to refresh segment: spanEventView__4__6651__20210124T0507Z (lock-time=0ms, queue-length=0)
[pinot-server-0 pinot-server] 2021/01/24 05:11:42.976 INFO [HelixInstanceDataManager] [HelixTaskExecutor-message_handle_thread] Reloading single segment: spanEventView__4__6651__20210124T0507Z in table: spanEventView_REALTIME
[pinot-server-0 pinot-server] 2021/01/24 05:11:42.982 INFO [HelixInstanceDataManager] [HelixTaskExecutor-message_handle_thread] Reloading segment: spanEventView__4__6651__20210124T0507Z in table: spanEventView_REALTIME
[pinot-server-0 pinot-server] 2021/01/24 05:11:42.982 INFO [HelixInstanceDataManager] [HelixTaskExecutor-message_handle_thread] Skip reloading REALTIME consuming segment: spanEventView__4__6651__20210124T0507Z in table: spanEventView_REALTIME
[pinot-server-0 pinot-server] 2021/01/24 05:11:42.982 INFO [HelixInstanceDataManager] [HelixTaskExecutor-message_handle_thread] Reloaded single segment: spanEventView__4__6651__20210124T0507Z in table: spanEventView_REALTIME
[pinot-server-0 pinot-server] 2021/01/24 05:11:42.985 INFO [HelixTask] [HelixTaskExecutor-message_handle_thread] Message: 59d16855-2d7e-418c-9e59-ee842decb97b (parent: null) handling task for spanEventView_REALTIME:spanEventView__4__6651__20210124T0507Z completed at: 1611465102985, results: true. FrameworkTime: 3 ms; HandlerTime: 7 ms.
[pinot-server-0 pinot-server] 2021/01/24 05:12:01.020 INFO [LLRealtimeSegmentDataManager_spanEventView__4__6651__20210124T0507Z] [spanEventView__4__6651__20210124T0507Z] Consumed 3382 events from (rate:54.638275/s), currentOffset=1019190068, numRowsConsumedSoFar=36479, numRowsIndexedSoFar=36479
Tried reload for this segment just now
I’ve noticed a discrepancy b/w the table page and individual segment page. At table level a segment is shown as BAD but when I drill into segment details page, it’s shown as ONLINE
oh wait. that above segment seems to be in consumption. let me reload one of the BAD segments
Copy code
[pinot-server-3 pinot-server] 2021/01/24 05:16:17.728 INFO [HelixTaskExecutor] [ZkClient-EventThread-22-zookeeper.traceable.svc.cluster.local:2181/pinot] Scheduling message 2fe3a4e4-4bbe-4eaf-831a-684049da8276: spanEventView_REALTIME:spanEventView__0__9668__20210123T0445Z, null->null
[pinot-server-3 pinot-server] 2021/01/24 05:16:17.729 INFO [spanEventView_REALTIME-SegmentReloadMessageHandler] [HelixTaskExecutor-message_handle_thread] Handling message: ZnRecord=2fe3a4e4-4bbe-4eaf-831a-684049da8276, {CREATE_TIMESTAMP=1611465377490, EXECUTE_START_TIMESTAMP=1611465377729, MSG_ID=2fe3a4e4-4bbe-4eaf-831a-684049da8276, MSG_STATE=new, MSG_SUBTYPE=RELOAD_SEGMENT, MSG_TYPE=USER_DEFINE_MSG, PARTITION_NAME=spanEventView__0__9668__20210123T0445Z, RESOURCE_NAME=spanEventView_REALTIME, RETRY_COUNT=0, SRC_CLUSTER=traceable-views, SRC_INSTANCE_TYPE=PARTICIPANT, SRC_NAME=Controller_pinot-controller-1.pinot-controller.traceable.svc.cluster.local_9000, TGT_NAME=Server_pinot-server-3.pinot-server.traceable.svc.cluster.local_8098, TGT_SESSION_ID=1760dff56ce038b, TIMEOUT=-1}{}{}, Stat=Stat {_version=0, _creationTime=1611465377706, _modifiedTime=1611465377706, _ephemeralOwner=0}
[pinot-server-3 pinot-server] 2021/01/24 05:16:17.729 INFO [spanEventView_REALTIME-SegmentReloadMessageHandler] [HelixTaskExecutor-message_handle_thread] Waiting for lock to refresh : spanEventView__0__9668__20210123T0445Z, queue-length: 0
[pinot-server-3 pinot-server] 2021/01/24 05:16:17.729 INFO [spanEventView_REALTIME-SegmentReloadMessageHandler] [HelixTaskExecutor-message_handle_thread] Acquired lock to refresh segment: spanEventView__0__9668__20210123T0445Z (lock-time=0ms, queue-length=0)
[pinot-server-3 pinot-server] 2021/01/24 05:16:17.729 INFO [HelixInstanceDataManager] [HelixTaskExecutor-message_handle_thread] Reloading single segment: spanEventView__0__9668__20210123T0445Z in table: spanEventView_REALTIME

[pinot-server-3 pinot-server] 2021/01/24 05:16:17.735 INFO [HelixInstanceDataManager] [HelixTaskExecutor-message_handle_thread] Reloading segment: spanEventView__0__9668__20210123T0445Z in table: spanEventView_REALTIME
[pinot-server-3 pinot-server] 2021/01/24 05:16:17.907 INFO [InvertedIndexHandler] [HelixTaskExecutor-message_handle_thread] Found inverted index for segment: spanEventView__0__9668__20210123T0445Z, column: tags__KEYS
[pinot-server-3 pinot-server] 2021/01/24 05:16:17.907 INFO [InvertedIndexHandler] [HelixTaskExecutor-message_handle_thread] Found inverted index for segment: spanEventView__0__9668__20210123T0445Z, column: service_id
[pinot-server-3 pinot-server] 2021/01/24 05:16:17.907 INFO [InvertedIndexHandler] [HelixTaskExecutor-message_handle_thread] Found inverted index for segment: spanEventView__0__9668__20210123T0445Z, column: tags__VALUES
[pinot-server-3 pinot-server] 2021/01/24 05:16:17.907 INFO [InvertedIndexHandler] [HelixTaskExecutor-message_handle_thread] Found inverted index for segment: spanEventView__0__9668__20210123T0445Z, column: api_id
[pinot-server-3 pinot-server] 2021/01/24 05:16:17.907 INFO [InvertedIndexHandler] [HelixTaskExecutor-message_handle_thread] Found inverted index for segment: spanEventView__0__9668__20210123T0445Z, column: customer_id
[pinot-server-3 pinot-server] 2021/01/24 05:16:17.907 INFO [InvertedIndexHandler] [HelixTaskExecutor-message_handle_thread] Found inverted index for segment: spanEventView__0__9668__20210123T0445Z, column: api_boundary_type
[pinot-server-3 pinot-server] 2021/01/24 05:16:17.907 INFO [RangeIndexHandler] [HelixTaskExecutor-message_handle_thread] Found range index for segment: spanEventView__0__9668__20210123T0445Z, column: start_time_millis
[pinot-server-3 pinot-server] 2021/01/24 05:16:17.920 INFO [ImmutableSegmentLoader] [HelixTaskExecutor-message_handle_thread] Successfully loaded segment spanEventView__0__9668__20210123T0445Z with readMode: mmap
[pinot-server-3 pinot-server] 2021/01/24 05:16:17.920 INFO [spanEventView_REALTIME-RealtimeTableDataManager] [HelixTaskExecutor-message_handle_thread] Adding immutable segment: spanEventView__0__9668__20210123T0445Z to table: spanEventView_REALTIME
[pinot-server-3 pinot-server] 2021/01/24 05:16:17.920 INFO [spanEventView_REALTIME-RealtimeTableDataManager] [HelixTaskExecutor-message_handle_thread] Replaced immutable segment: spanEventView__0__9668__20210123T0445Z of table: spanEventView_REALTIME
[pinot-server-3 pinot-server] 2021/01/24 05:16:17.920 INFO [spanEventView_REALTIME-RealtimeTableDataManager] [HelixTaskExecutor-message_handle_thread] Closing segment: spanEventView__0__9668__20210123T0445Z of table: spanEventView_REALTIME
[pinot-server-3 pinot-server] 2021/01/24 05:16:17.920 INFO [ImmutableSegmentImpl] [HelixTaskExecutor-message_handle_thread] Trying to destroy segment : spanEventView__0__9668__20210123T0445Z
[pinot-server-3 pinot-server] 2021/01/24 05:16:17.921 INFO [spanEventView_REALTIME-RealtimeTableDataManager] [HelixTaskExecutor-message_handle_thread] Closed segment: spanEventView__0__9668__20210123T0445Z of table: spanEventView_REALTIME
[pinot-server-3 pinot-server] 2021/01/24 05:16:17.921 INFO [HelixInstanceDataManager] [HelixTaskExecutor-message_handle_thread] Reloaded segment: spanEventView__0__9668__20210123T0445Z in table: spanEventView_REALTIME
[pinot-server-3 pinot-server] 2021/01/24 05:16:17.927 INFO [HelixInstanceDataManager] [HelixTaskExecutor-message_handle_thread] Reloaded single segment: spanEventView__0__9668__20210123T0445Z in table: spanEventView_REALTIME
[pinot-server-3 pinot-server] 2021/01/24 05:16:17.931 INFO [HelixTask] [HelixTaskExecutor-message_handle_thread] Message: 2fe3a4e4-4bbe-4eaf-831a-684049da8276 (parent: null) handling task for spanEventView_REALTIME:spanEventView__0__9668__20210123T0445Z completed at: 1611465377931, results: true. FrameworkTime: 4 ms; HandlerTime: 198 ms.
Here you go @Mayank Thank you very much for the help
m
This log seems fine, The refresh is a close->reopen.
b
okay
BTW, is there a way to only reload all BAD segments? Don’t see an REST API or option in the UI
m
restart the server?
b
that’s reload of all tables and all segments right?
m
Dont recall if there is an api, but would be a good feature ask if there isn't one
oh there's multiple tables on this one server?
There used to be a disable table, if it still exists then try disable->enable of the table
b
yeah we have 7 tables and they’re spread across all pinot-servers
m
Interesting, you have multi-tenant realtime then.
b
multiple tables but all in default tenant and all realtime tables only
I thought this is the usual deployment. isn’t it?
BTW, what does table disable mean? we have a lot of segments so want to make sure disable isn’t very expensive or doesn’t lose data
m
You won't lose data, but reload of all segments may take a while, so may be not a good idea.
May be just write some script to loop over all bad segments
and reload
b
yeah was thinking about that but thought someone would have done it already 🙂 let me do it
k
@Xiang Fu when I had a bad segment (due to OOM during generation of inverted index) and tried to reload it, I got:
Copy code
2021/01/22 23:44:55.035 INFO [HelixInstanceDataManager] [HelixTaskExecutor-message_handle_thread] Reloading single segment: crawldata_OFFLINE_2018-12-01_2018-12-31_126 in table: crawldata_OFFLINE
2021/01/22 23:44:55.035 INFO [HelixInstanceDataManager] [HelixTaskExecutor-message_handle_thread] Segment metadata is null. Skip reloading segment: crawldata_OFFLINE_2018-12-01_2018-12-31_126 in table: crawldata_OFFLINE