A simple question Does it make sense to keep segment retenti Apache Pinot #general

A simple question: Does it make sense to keep seg...

Apoorva Moghey

07/26/2020, 1:31 PM

A simple question: Does it make sense to keep segment retention 4 years in case of realtime tables? Shall we go for a hybrid table? What is the recommendation? or is this decision depend upon other factors? In our case, we will always be ingesting data via Kafka.

Kartik Khare

07/26/2020, 1:32 PM

Hi Apoorva, Can you explain your usecase if possible?

Apoorva Moghey

07/26/2020, 1:37 PM

Sure. We have a schema that contains the payment method used by the user. On average users do shopping 3 times a year. To give a personalized experience to the user based upon the last payment used, we wanted to fetch give past 10 payment methods used. These 10 payment methods increasing the retention rate of the segment.

Kartik Khare

07/26/2020, 1:49 PM

got it. followup question - what do you plan to do in the case the user hasn't showed up for like few months/days?

Kartik Khare

07/26/2020, 1:51 PM

I am asking this to determine if it makes sense to store all the past user data in a single row only since it is sparse

Apoorva Moghey

07/26/2020, 1:52 PM

Nothing as such.

Kartik Khare

07/26/2020, 1:55 PM

@Xiang Fu I think long retention can be enabled in this case. What are your thoughts?

Mayank

07/26/2020, 2:07 PM

There are a couple of advantages of having a hybrid table. More preaggregation can happen offline, so you can get better performance. Your Kafka cluster may not provide such a long retention, so in case you have to rebootstrap, that won’t be possible. Also backfills are not possible in real-time

Xiang Fu

07/26/2020, 8:41 PM

I would suggest to have a hybrid table if we foresee there could be data rollup/backfill happen for the table. Otherwise, it doesn’t matter too much.

Apoorva Moghey

07/27/2020, 6:52 AM

In our use there is no backfill or pre-aggregation. If it doesn't matter then, we will go with 4 year retention period.

Subbu Subramaniam

07/28/2020, 3:31 PM

@Apoorva Moghey How many segments do you create per day? (leading to) What do you expect the number of segments to be in 4 years? Are you ok with the lantency of searching that many segments? You may want to size your segments reasonably in case you are going for.a 4 year retention. I have modified the RealtimeProvisioningHelper tool (and also documented it in the latest doc). You may want to give it a spin to see what your hardware sizing looks like, and how many segments will be searched for your queries.

Apoorva Moghey

07/28/2020, 3:38 PM

Copy code

How many segments do you create per day?
(leading to) What do you expect the number of segments to be in 4 years?

I don't have these numbers as of now, we just started exploring this new use case. Since this is a customer facing use case we are not ok with high latency. Definitely we will check

RealtimeProvisioningHelper

. As I understood higher number of segment then higher the latency. So we need to carefully decide our segment size.

Subbu Subramaniam

07/28/2020, 3:48 PM

The number of rows in a segment and the overall number of segments should be balanced. A segment is nothing but shard. If you have too many small segments that will cause the per-segment overhead go go up, so it is better to combine and have a smaller number of larger segments. On the other hand, a segment is processed in one thread, so having too large a segment is not good for latency either. It is a balancing game, and you need to adjust it.

Apoorva Moghey

07/28/2020, 3:51 PM

Understood.

Subbu Subramaniam

07/28/2020, 3:52 PM

That said, we are working on features to merge segments, so some of that can be useful in your use case. If your segment store has good redundancy, then the short kafka retention is less of a concern for your use case. Otherwise, I would prefer a hybrid table for the backup that it provides in terms of source data. From what you mention (a user transacts a few times a year) it seems like you may not get much aggregation in offline segments, but correct me if I am wrong.

Apoorva Moghey

07/28/2020, 3:53 PM

yes it is true, user transacts avg 3 times a year. Neither we have any offline aggregation.

Subbu Subramaniam

07/28/2020, 3:59 PM

If you don't have sample data, you cannot make use of RealtimeProvisioningHelper. You may want to simulate sample data for a few days, and use that to run the tool

Apoorva Moghey

07/28/2020, 4:01 PM

We have the sample data on how data will look alike. we just need to simulate with

RealtimeProvisioningHelper

Subbu Subramaniam

07/28/2020, 9:05 PM

please read through the docs and use the current checked in version. I can answer any questions if you like.

Apoorva Moghey

07/29/2020, 7:23 AM

Sure @Subbu Subramaniam

Open in Slack

Previous Next