Apologies for all the trouble today we noticed that some tab Apache Pinot #troubleshooting

Apologies for all the trouble today: we noticed th...

Elon

02/10/2021, 1:45 AM

Apologies for all the trouble today: we noticed that some tables are in a "bad" (cluster manager ui) state. Looks like it's due to an attempt by servers to download non-existent segments from deepstore. Could it be that the segment was empty and not copied to deepstore?

Elon

02/10/2021, 1:46 AM

Should I just delete the segments to restore the idealState to good? Or could this be an issue w SegmentDeletionManager?

Elon

02/10/2021, 1:53 AM

Also getting messages like this:

Copy code

2021/02/10 01:51:21.689 WARN [SegmentStatusChecker] [pool-8-thread-3] Table XXX has 5 segments with no online replicas
2021/02/10 01:51:21.689 WARN [SegmentStatusChecker] [pool-8-thread-3] Table XXX has 0 replicas, below replication threshold :3
2021/02/10 01:51:21.796 WARN [SegmentStatusChecker] [pool-8-thread-3] Table XXX has 1 replicas, below replication threshold :3
2021/02/10 01:51:21.815 WARN [SegmentStatusChecker] [pool-8-thread-3] Table XXX has 2 replicas, below replication threshold :3
2021/02/10 01:51:21.877 WARN [SegmentStatusChecker] [pool-8-thread-3] Table XXX has 1 replicas, below replication threshold :3
(

Elon

02/10/2021, 2:06 AM

Segments appear to be on a server but replication factor is < desired, and segment is not on deepstore, any way to get the replication factor back to 3 and save whatever is not on deepstore to deepstore?

Elon

02/10/2021, 2:12 AM

I know what happened: I mistakenly scaled down the server statefulset and the tenants these tables on were down for days. The newer segments are ok. Should I delete the segments marked "BAD"?

Elon

02/10/2021, 2:13 AM

I have a copy in gcs - is there a way I can move them to the deepstore directory to download?

Elon

02/10/2021, 2:22 AM

I see that the segments exist on servers but still get "bad" for the segment status in cluster manager:

Elon

02/10/2021, 2:22 AM

Screen Shot 2021-02-09 at 6.15.31 PM.png

Elon

02/10/2021, 3:02 AM

I copied a segment from the server to deepstore (tgz'd) and tried the reloadSegment api but it got a failure message

Copy code

2021/02/10 00:52:18.810 WARN [integrations_operation_store_failure_stat_REALTIME-RealtimeTableDataManager] [HelixTaskExecutor-message_handle_thread] Failed to download segment integrations_operation_store_failure_stat__0__96__20201216T0642Z from deep store:

Elon

02/10/2021, 3:51 AM

tldr: I see that replicas per partition == 3 and ideal state appears to be good for newer segments but the cluster manager page (controller ui) shows the segment in a "bad" state and the servers it's on (gif above) are not accurate

Xiang Fu

02/10/2021, 4:30 AM

hmm, are those segments not there at beginning or they are deleted before those segments are purged

Elon

02/10/2021, 4:31 AM

Not sure, but I was able to copy the segment from a server it was on (only 1) to gcs, then I upgraded to pinot 6 and that segment was suddently visible

Elon

02/10/2021, 4:31 AM

Trying to copy the rest to gcs - but not sure how to reload, do I just do rebalance on the table?

Xiang Fu

02/10/2021, 4:32 AM

or just force reload the table

Xiang Fu

02/10/2021, 4:32 AM

maybe disable then enable the table

Elon

02/10/2021, 4:32 AM

Is that the "reload all segments" api?

Elon

02/10/2021, 4:33 AM

for force reloading?

Elon

02/10/2021, 4:33 AM

also, how to disable and enable the table?

Elon

02/10/2021, 4:33 AM

Ah, the enable/disable is the get request, right?

Xiang Fu

02/10/2021, 4:34 AM

you can check the swagger API

Xiang Fu

02/10/2021, 4:34 AM

yes

Elon

02/10/2021, 4:35 AM

thanks a lot!

Elon

02/10/2021, 4:35 AM

Trying this now

🌟 1

Elon

02/10/2021, 5:13 AM

This worked @Xiang Fu - I tried reload an individual segment, it didn't work, but when I restarted all the servers they all came online

Elon

02/10/2021, 5:13 AM

All good!

Elon

02/10/2021, 5:13 AM

So I found the server each segment was on, tgz'd it, copied to gcs (from the server pod), then once done restart all the servers

Xiang Fu

02/10/2021, 5:27 AM

cool!

Xiang Fu

02/10/2021, 5:28 AM

I think once you have time, it's better to do an idealstate dump and check if all segments are matching the gcs tar'ed segments

Xiang Fu

02/10/2021, 5:28 AM

so we will know if there is any existing segments missing

👍 1

Elon

02/10/2021, 5:29 AM

thanks, will do

Xiang Fu

02/10/2021, 9:31 AM

we may also extend validation manager to validate idealstates and coressponding segment deepstore location to ensure the existence

Open in Slack

Previous Next