https://pinot.apache.org/ logo
b

Buchi Reddy

01/24/2021, 12:53 AM
Hey folks, a bunch of segments are in BAD state in Pinot and reloading all the segments from the UI didn’t help. Trying to debug it further but didn’t see any warnings or errors in the logs. Any hints on where to check? Any usual suspects?
m

Mayank

01/24/2021, 12:56 AM
Server log should definitely have the error upon loading
x

Xiang Fu

01/24/2021, 2:34 AM
can you try to reload one bad segment from the server and check the log from it ?
the log file pinotServer.log inside the container should tell
k

Ken Krugler

01/24/2021, 2:55 AM
When I wound up with bad segments due to OOME with building inverted indexes, the generated files were invalid. So doing a reload would fail due to “invalid metadata” errors from the segments.
b

Buchi Reddy

01/24/2021, 3:42 AM
How did you get out of that then @Ken Krugler?
thanks @Xiang Fu I didn’t check pinotServer.log
x

Xiang Fu

01/24/2021, 3:44 AM
Oh? What’s the exact metadata invalid exception? We should fix that
m

Mayank

01/24/2021, 3:45 AM
For @Ken Krugler the inv index generation at segment load failed I think -OOM due to large inv index size
b

Buchi Reddy

01/24/2021, 5:04 AM
Just curious, what does this destroy message mean?
Copy code
[pinot-server-2 pinot-server] 2021/01/24 05:04:23.421 INFO [ImmutableSegmentImpl] [HelixTaskExecutor-message_handle_thread] Trying to destroy segment : spanEventView__7__6062__20210122T1653Z
And, if there are some BAD segments for a table, would the consumption from the table continue to happen or not?
the state where we’re struck is, two tables have a bunch of BAD segments and the kafka consumer lag on those two topics is continuing to increase
m

Mayank

01/24/2021, 5:05 AM
Destroy message displayed when Closing an open segment
b

Buchi Reddy

01/24/2021, 5:08 AM
So, why is the segment being closed when reloaded? maybe i don’t understand the full details of it
m

Mayank

01/24/2021, 5:09 AM
Can you for the segment name to get all the log messages about it
b

Buchi Reddy

01/24/2021, 5:12 AM
Copy code
[pinot-server-0 pinot-server] 2021/01/24 05:11:42.975 INFO [HelixTaskExecutor] [ZkClient-EventThread-22-zookeeper.traceable.svc.cluster.local:2181/pinot] Scheduling message 59d16855-2d7e-418c-9e59-ee842decb97b: spanEventView_REALTIME:spanEventView__4__6651__20210124T0507Z, null->null


[pinot-server-0 pinot-server] 2021/01/24 05:11:42.976 INFO [spanEventView_REALTIME-SegmentReloadMessageHandler] [HelixTaskExecutor-message_handle_thread] Handling message: ZnRecord=59d16855-2d7e-418c-9e59-ee842decb97b, {CREATE_TIMESTAMP=1611465102654, EXECUTE_START_TIMESTAMP=1611465102975, MSG_ID=59d16855-2d7e-418c-9e59-ee842decb97b, MSG_STATE=new, MSG_SUBTYPE=RELOAD_SEGMENT, MSG_TYPE=USER_DEFINE_MSG, PARTITION_NAME=spanEventView__4__6651__20210124T0507Z, RESOURCE_NAME=spanEventView_REALTIME, RETRY_COUNT=0, SRC_CLUSTER=traceable-views, SRC_INSTANCE_TYPE=PARTICIPANT, SRC_NAME=Controller_pinot-controller-1.pinot-controller.traceable.svc.cluster.local_9000, TGT_NAME=Server_pinot-server-0.pinot-server.traceable.svc.cluster.local_8098, TGT_SESSION_ID=3760dff56d50350, TIMEOUT=-1}{}{}, Stat=Stat {_version=0, _creationTime=1611465102938, _modifiedTime=1611465102938, _ephemeralOwner=0}
[pinot-server-0 pinot-server] 2021/01/24 05:11:42.976 INFO [spanEventView_REALTIME-SegmentReloadMessageHandler] [HelixTaskExecutor-message_handle_thread] Waiting for lock to refresh : spanEventView__4__6651__20210124T0507Z, queue-length: 0
[pinot-server-0 pinot-server] 2021/01/24 05:11:42.976 INFO [spanEventView_REALTIME-SegmentReloadMessageHandler] [HelixTaskExecutor-message_handle_thread] Acquired lock to refresh segment: spanEventView__4__6651__20210124T0507Z (lock-time=0ms, queue-length=0)
[pinot-server-0 pinot-server] 2021/01/24 05:11:42.976 INFO [HelixInstanceDataManager] [HelixTaskExecutor-message_handle_thread] Reloading single segment: spanEventView__4__6651__20210124T0507Z in table: spanEventView_REALTIME
[pinot-server-0 pinot-server] 2021/01/24 05:11:42.982 INFO [HelixInstanceDataManager] [HelixTaskExecutor-message_handle_thread] Reloading segment: spanEventView__4__6651__20210124T0507Z in table: spanEventView_REALTIME
[pinot-server-0 pinot-server] 2021/01/24 05:11:42.982 INFO [HelixInstanceDataManager] [HelixTaskExecutor-message_handle_thread] Skip reloading REALTIME consuming segment: spanEventView__4__6651__20210124T0507Z in table: spanEventView_REALTIME
[pinot-server-0 pinot-server] 2021/01/24 05:11:42.982 INFO [HelixInstanceDataManager] [HelixTaskExecutor-message_handle_thread] Reloaded single segment: spanEventView__4__6651__20210124T0507Z in table: spanEventView_REALTIME
[pinot-server-0 pinot-server] 2021/01/24 05:11:42.985 INFO [HelixTask] [HelixTaskExecutor-message_handle_thread] Message: 59d16855-2d7e-418c-9e59-ee842decb97b (parent: null) handling task for spanEventView_REALTIME:spanEventView__4__6651__20210124T0507Z completed at: 1611465102985, results: true. FrameworkTime: 3 ms; HandlerTime: 7 ms.
[pinot-server-0 pinot-server] 2021/01/24 05:12:01.020 INFO [LLRealtimeSegmentDataManager_spanEventView__4__6651__20210124T0507Z] [spanEventView__4__6651__20210124T0507Z] Consumed 3382 events from (rate:54.638275/s), currentOffset=1019190068, numRowsConsumedSoFar=36479, numRowsIndexedSoFar=36479
Tried reload for this segment just now
I’ve noticed a discrepancy b/w the table page and individual segment page. At table level a segment is shown as BAD but when I drill into segment details page, it’s shown as ONLINE
oh wait. that above segment seems to be in consumption. let me reload one of the BAD segments
Copy code
[pinot-server-3 pinot-server] 2021/01/24 05:16:17.728 INFO [HelixTaskExecutor] [ZkClient-EventThread-22-zookeeper.traceable.svc.cluster.local:2181/pinot] Scheduling message 2fe3a4e4-4bbe-4eaf-831a-684049da8276: spanEventView_REALTIME:spanEventView__0__9668__20210123T0445Z, null->null
[pinot-server-3 pinot-server] 2021/01/24 05:16:17.729 INFO [spanEventView_REALTIME-SegmentReloadMessageHandler] [HelixTaskExecutor-message_handle_thread] Handling message: ZnRecord=2fe3a4e4-4bbe-4eaf-831a-684049da8276, {CREATE_TIMESTAMP=1611465377490, EXECUTE_START_TIMESTAMP=1611465377729, MSG_ID=2fe3a4e4-4bbe-4eaf-831a-684049da8276, MSG_STATE=new, MSG_SUBTYPE=RELOAD_SEGMENT, MSG_TYPE=USER_DEFINE_MSG, PARTITION_NAME=spanEventView__0__9668__20210123T0445Z, RESOURCE_NAME=spanEventView_REALTIME, RETRY_COUNT=0, SRC_CLUSTER=traceable-views, SRC_INSTANCE_TYPE=PARTICIPANT, SRC_NAME=Controller_pinot-controller-1.pinot-controller.traceable.svc.cluster.local_9000, TGT_NAME=Server_pinot-server-3.pinot-server.traceable.svc.cluster.local_8098, TGT_SESSION_ID=1760dff56ce038b, TIMEOUT=-1}{}{}, Stat=Stat {_version=0, _creationTime=1611465377706, _modifiedTime=1611465377706, _ephemeralOwner=0}
[pinot-server-3 pinot-server] 2021/01/24 05:16:17.729 INFO [spanEventView_REALTIME-SegmentReloadMessageHandler] [HelixTaskExecutor-message_handle_thread] Waiting for lock to refresh : spanEventView__0__9668__20210123T0445Z, queue-length: 0
[pinot-server-3 pinot-server] 2021/01/24 05:16:17.729 INFO [spanEventView_REALTIME-SegmentReloadMessageHandler] [HelixTaskExecutor-message_handle_thread] Acquired lock to refresh segment: spanEventView__0__9668__20210123T0445Z (lock-time=0ms, queue-length=0)
[pinot-server-3 pinot-server] 2021/01/24 05:16:17.729 INFO [HelixInstanceDataManager] [HelixTaskExecutor-message_handle_thread] Reloading single segment: spanEventView__0__9668__20210123T0445Z in table: spanEventView_REALTIME

[pinot-server-3 pinot-server] 2021/01/24 05:16:17.735 INFO [HelixInstanceDataManager] [HelixTaskExecutor-message_handle_thread] Reloading segment: spanEventView__0__9668__20210123T0445Z in table: spanEventView_REALTIME
[pinot-server-3 pinot-server] 2021/01/24 05:16:17.907 INFO [InvertedIndexHandler] [HelixTaskExecutor-message_handle_thread] Found inverted index for segment: spanEventView__0__9668__20210123T0445Z, column: tags__KEYS
[pinot-server-3 pinot-server] 2021/01/24 05:16:17.907 INFO [InvertedIndexHandler] [HelixTaskExecutor-message_handle_thread] Found inverted index for segment: spanEventView__0__9668__20210123T0445Z, column: service_id
[pinot-server-3 pinot-server] 2021/01/24 05:16:17.907 INFO [InvertedIndexHandler] [HelixTaskExecutor-message_handle_thread] Found inverted index for segment: spanEventView__0__9668__20210123T0445Z, column: tags__VALUES
[pinot-server-3 pinot-server] 2021/01/24 05:16:17.907 INFO [InvertedIndexHandler] [HelixTaskExecutor-message_handle_thread] Found inverted index for segment: spanEventView__0__9668__20210123T0445Z, column: api_id
[pinot-server-3 pinot-server] 2021/01/24 05:16:17.907 INFO [InvertedIndexHandler] [HelixTaskExecutor-message_handle_thread] Found inverted index for segment: spanEventView__0__9668__20210123T0445Z, column: customer_id
[pinot-server-3 pinot-server] 2021/01/24 05:16:17.907 INFO [InvertedIndexHandler] [HelixTaskExecutor-message_handle_thread] Found inverted index for segment: spanEventView__0__9668__20210123T0445Z, column: api_boundary_type
[pinot-server-3 pinot-server] 2021/01/24 05:16:17.907 INFO [RangeIndexHandler] [HelixTaskExecutor-message_handle_thread] Found range index for segment: spanEventView__0__9668__20210123T0445Z, column: start_time_millis
[pinot-server-3 pinot-server] 2021/01/24 05:16:17.920 INFO [ImmutableSegmentLoader] [HelixTaskExecutor-message_handle_thread] Successfully loaded segment spanEventView__0__9668__20210123T0445Z with readMode: mmap
[pinot-server-3 pinot-server] 2021/01/24 05:16:17.920 INFO [spanEventView_REALTIME-RealtimeTableDataManager] [HelixTaskExecutor-message_handle_thread] Adding immutable segment: spanEventView__0__9668__20210123T0445Z to table: spanEventView_REALTIME
[pinot-server-3 pinot-server] 2021/01/24 05:16:17.920 INFO [spanEventView_REALTIME-RealtimeTableDataManager] [HelixTaskExecutor-message_handle_thread] Replaced immutable segment: spanEventView__0__9668__20210123T0445Z of table: spanEventView_REALTIME
[pinot-server-3 pinot-server] 2021/01/24 05:16:17.920 INFO [spanEventView_REALTIME-RealtimeTableDataManager] [HelixTaskExecutor-message_handle_thread] Closing segment: spanEventView__0__9668__20210123T0445Z of table: spanEventView_REALTIME
[pinot-server-3 pinot-server] 2021/01/24 05:16:17.920 INFO [ImmutableSegmentImpl] [HelixTaskExecutor-message_handle_thread] Trying to destroy segment : spanEventView__0__9668__20210123T0445Z
[pinot-server-3 pinot-server] 2021/01/24 05:16:17.921 INFO [spanEventView_REALTIME-RealtimeTableDataManager] [HelixTaskExecutor-message_handle_thread] Closed segment: spanEventView__0__9668__20210123T0445Z of table: spanEventView_REALTIME
[pinot-server-3 pinot-server] 2021/01/24 05:16:17.921 INFO [HelixInstanceDataManager] [HelixTaskExecutor-message_handle_thread] Reloaded segment: spanEventView__0__9668__20210123T0445Z in table: spanEventView_REALTIME
[pinot-server-3 pinot-server] 2021/01/24 05:16:17.927 INFO [HelixInstanceDataManager] [HelixTaskExecutor-message_handle_thread] Reloaded single segment: spanEventView__0__9668__20210123T0445Z in table: spanEventView_REALTIME
[pinot-server-3 pinot-server] 2021/01/24 05:16:17.931 INFO [HelixTask] [HelixTaskExecutor-message_handle_thread] Message: 2fe3a4e4-4bbe-4eaf-831a-684049da8276 (parent: null) handling task for spanEventView_REALTIME:spanEventView__0__9668__20210123T0445Z completed at: 1611465377931, results: true. FrameworkTime: 4 ms; HandlerTime: 198 ms.
Here you go @Mayank Thank you very much for the help
m

Mayank

01/24/2021, 5:19 AM
This log seems fine, The refresh is a close->reopen.
b

Buchi Reddy

01/24/2021, 5:20 AM
okay
BTW, is there a way to only reload all BAD segments? Don’t see an REST API or option in the UI
m

Mayank

01/24/2021, 5:20 AM
restart the server?
b

Buchi Reddy

01/24/2021, 5:21 AM
that’s reload of all tables and all segments right?
m

Mayank

01/24/2021, 5:21 AM
Dont recall if there is an api, but would be a good feature ask if there isn't one
oh there's multiple tables on this one server?
There used to be a disable table, if it still exists then try disable->enable of the table
b

Buchi Reddy

01/24/2021, 5:22 AM
yeah we have 7 tables and they’re spread across all pinot-servers
m

Mayank

01/24/2021, 5:22 AM
Interesting, you have multi-tenant realtime then.
b

Buchi Reddy

01/24/2021, 5:24 AM
multiple tables but all in default tenant and all realtime tables only
I thought this is the usual deployment. isn’t it?
BTW, what does table disable mean? we have a lot of segments so want to make sure disable isn’t very expensive or doesn’t lose data
m

Mayank

01/24/2021, 5:25 AM
You won't lose data, but reload of all segments may take a while, so may be not a good idea.
May be just write some script to loop over all bad segments
and reload
b

Buchi Reddy

01/24/2021, 5:26 AM
yeah was thinking about that but thought someone would have done it already 🙂 let me do it
k

Ken Krugler

01/24/2021, 4:16 PM
@Xiang Fu when I had a bad segment (due to OOM during generation of inverted index) and tried to reload it, I got:
Copy code
2021/01/22 23:44:55.035 INFO [HelixInstanceDataManager] [HelixTaskExecutor-message_handle_thread] Reloading single segment: crawldata_OFFLINE_2018-12-01_2018-12-31_126 in table: crawldata_OFFLINE
2021/01/22 23:44:55.035 INFO [HelixInstanceDataManager] [HelixTaskExecutor-message_handle_thread] Segment metadata is null. Skip reloading segment: crawldata_OFFLINE_2018-12-01_2018-12-31_126 in table: crawldata_OFFLINE