https://pinot.apache.org/ logo
v

vmarchaud

02/08/2021, 4:26 PM
Hey, quick question i don't find anything on the docs, i have a realtime table with a consuming segment and i would like to stop the consumtion and save it into the deep storage without creating a new consuming segment ? My use-case is simply to be able to stop ingesting new events to do some tasks like updating the server or any maintenance. Thanks !
t

Tanmay Movva

02/08/2021, 4:35 PM
If you are not playing around with topic offsets, then Disable Table -> maintenance -> Enable Table should help you. be aware that disabling a table makes it unavailable for querying also.
k

Kishore G

02/08/2021, 4:37 PM
we have talked about adding pause/unpause operations multiple times... we are yet to agree on a safe way to achieve this @Subbu Subramaniam ^^
v

vmarchaud

02/08/2021, 4:44 PM
Sadly i'm playing around with topic offsets because i use gcp pubsub system (that doesnt have any partition/offset). I wrote a plugin that essentialy fake offset to be able to use LLC segments
s

Subbu Subramaniam

02/08/2021, 5:02 PM
@vmarchaud you should be able to update the server even as rows are getting consumed, since consumption will resume from where it left off before maintenance.
v

vmarchaud

02/08/2021, 5:05 PM
@Subbu Subramaniam Thats not what i observe, the consuming segment isn't fsync'ed to the disk so when the server restart the segment is empty
And i think thats what is described there: https://docs.pinot.apache.org/developers/advanced/data-ingestion#ingesting-realtime-data (
In either mode, Pinot servers store the ingested rows in volatile memory until either one of the following conditions are met:
)
s

Subbu Subramaniam

02/08/2021, 5:06 PM
On a restart it will start consuming from where the previous segment completed. There is no fsync going on anytime.
v

vmarchaud

02/08/2021, 5:08 PM
So if i understand you are saying that when shutting down the server, it should complete the segment ?
s

Subbu Subramaniam

02/08/2021, 5:17 PM
No. Let me try with an example. Assume we have messages in offsets 4, 8, 12, ... in partition 13 of a stream. The server consumed until offset 48 and completed segment 33, and saved it in segment store. It starts to consume from offset 52 for segment 34. Let us say it consumes offsets 52 and 56, and the server is restarted. The messages consumed for 52 and 56 (from volatile memory) will be discarded. When the server comes back, it will start consuming from 52 again. Hope this helps.
v

vmarchaud

02/08/2021, 5:19 PM
Okay got it, that was my understanding with a system like kafka. However as i said above we "fake" offset because our gcp pubsub doesn't implement them so it doesn't work for us. But thanks for the answers.
s

Subbu Subramaniam

02/08/2021, 5:37 PM
While a primitive like "commit segment now, and then stop further consumption" may help some, a few things are still unclear to me. (1) How is this sustainable in a production environment when servers may be restarted anytime? (2) How do you "fake" offsets? Continuing from my previous example, if after consuming offset 52, pinot gets a command to commit now and hold peace. Pinot creats a segment with 52, and sets the next offset to consume from as 56 but disables consumption. If, after maintenance, your offsets are not valid anymore, will it still not be a problem?
v

vmarchaud

02/08/2021, 5:43 PM
1. I would guess that this would be only useful for maintenance but you are right that offset fixes this problem (except maybe pausing while performing multiple restart) 2. Clearly thats a hack (if you are interested, the code is there: https://github.com/reelevant-tech/pinot-pubsub-plugin/blob/v2.0.0/src/main/java/co[…]t/pinot/plugins/stream/pubsub/PubSubPartitionLevelConsumer.java). In my case that would not matter because the plugin read the latest messages (since there are no concept of offset to read from)
s

Subbu Subramaniam

02/08/2021, 5:50 PM
So, if I understand the implementation right, you disregard the offset passed in, and just get the next set of messages. In other words, the "offset" is maintained by the stream partition, and it does not provide for a way to consume a message multiple times. Am I right? In that case, the way to phrase the problem is to support a stream like this. While the solution (I agree, hack) you have is a good demo, any reasonable installation will require a clear handling of the case when a server's power is pulled. Can you file an issue to support streams like this (would help if you point to this aspect and any other differences that you know off-hand. Clearly, having pause/restart will not help in the larger case. Do you agree?
v

vmarchaud

02/08/2021, 5:55 PM
Totally agree with you. I think we'll need to transition over a different pubsub system to correctly handle this correctly (fortunaly we were expecting to do this anyway).
I guess i could open a ticket so people looking to integrate with GCP pubsub system will have all informations needed (and maybe someday pinot will handle it somewhat)
👍 2