https://pinot.apache.org/ logo
e

Elon

02/10/2021, 1:45 AM
Apologies for all the trouble today: we noticed that some tables are in a "bad" (cluster manager ui) state. Looks like it's due to an attempt by servers to download non-existent segments from deepstore. Could it be that the segment was empty and not copied to deepstore?
Should I just delete the segments to restore the idealState to good? Or could this be an issue w SegmentDeletionManager?
Also getting messages like this:
Copy code
2021/02/10 01:51:21.689 WARN [SegmentStatusChecker] [pool-8-thread-3] Table XXX has 5 segments with no online replicas
2021/02/10 01:51:21.689 WARN [SegmentStatusChecker] [pool-8-thread-3] Table XXX has 0 replicas, below replication threshold :3
2021/02/10 01:51:21.796 WARN [SegmentStatusChecker] [pool-8-thread-3] Table XXX has 1 replicas, below replication threshold :3
2021/02/10 01:51:21.815 WARN [SegmentStatusChecker] [pool-8-thread-3] Table XXX has 2 replicas, below replication threshold :3
2021/02/10 01:51:21.877 WARN [SegmentStatusChecker] [pool-8-thread-3] Table XXX has 1 replicas, below replication threshold :3
(
Segments appear to be on a server but replication factor is < desired, and segment is not on deepstore, any way to get the replication factor back to 3 and save whatever is not on deepstore to deepstore?
I know what happened: I mistakenly scaled down the server statefulset and the tenants these tables on were down for days. The newer segments are ok. Should I delete the segments marked "BAD"?
I have a copy in gcs - is there a way I can move them to the deepstore directory to download?
I see that the segments exist on servers but still get "bad" for the segment status in cluster manager:
message has been deleted
I copied a segment from the server to deepstore (tgz'd) and tried the reloadSegment api but it got a failure message
Copy code
2021/02/10 00:52:18.810 WARN [integrations_operation_store_failure_stat_REALTIME-RealtimeTableDataManager] [HelixTaskExecutor-message_handle_thread] Failed to download segment integrations_operation_store_failure_stat__0__96__20201216T0642Z from deep store:
tldr: I see that replicas per partition == 3 and ideal state appears to be good for newer segments but the cluster manager page (controller ui) shows the segment in a "bad" state and the servers it's on (gif above) are not accurate
x

Xiang Fu

02/10/2021, 4:30 AM
hmm, are those segments not there at beginning or they are deleted before those segments are purged
e

Elon

02/10/2021, 4:31 AM
Not sure, but I was able to copy the segment from a server it was on (only 1) to gcs, then I upgraded to pinot 6 and that segment was suddently visible
Trying to copy the rest to gcs - but not sure how to reload, do I just do rebalance on the table?
x

Xiang Fu

02/10/2021, 4:32 AM
or just force reload the table
maybe disable then enable the table
e

Elon

02/10/2021, 4:32 AM
Is that the "reload all segments" api?
for force reloading?
also, how to disable and enable the table?
Ah, the enable/disable is the get request, right?
x

Xiang Fu

02/10/2021, 4:34 AM
you can check the swagger API
yes
e

Elon

02/10/2021, 4:35 AM
thanks a lot!
Trying this now
This worked @Xiang Fu - I tried reload an individual segment, it didn't work, but when I restarted all the servers they all came online
All good!
So I found the server each segment was on, tgz'd it, copied to gcs (from the server pod), then once done restart all the servers
x

Xiang Fu

02/10/2021, 5:27 AM
cool!
I think once you have time, it's better to do an idealstate dump and check if all segments are matching the gcs tar'ed segments
so we will know if there is any existing segments missing
👍 1
e

Elon

02/10/2021, 5:29 AM
thanks, will do
x

Xiang Fu

02/10/2021, 9:31 AM
we may also extend validation manager to validate idealstates and coressponding segment deepstore location to ensure the existence