A simple question: Does it make sense to keep seg...
# general
a
A simple question: Does it make sense to keep segment retention 4 years in case of realtime tables?  Shall we go for a hybrid table?  What is the recommendation?  or is this decision depend upon other factors?  In our case, we will always be ingesting data via Kafka.
k
Hi Apoorva, Can you explain your usecase if possible?
a
Sure. We have a schema that contains the payment method used by the user.  On average users do shopping 3 times a year. To give a personalized experience to the user based upon the last payment used,  we wanted to fetch give past 10 payment methods used. These 10 payment methods increasing the retention rate of the segment.
k
got it. followup question - what do you plan to do in the case the user hasn't showed up for like few months/days?
I am asking this to determine if it makes sense to store all the past user data in a single row only since it is sparse
a
Nothing as such.
k
@Xiang Fu I think long retention can be enabled in this case. What are your thoughts?
m
There are a couple of advantages of having a hybrid table. More preaggregation can happen offline, so you can get better performance. Your Kafka cluster may not provide such a long retention, so in case you have to rebootstrap, that won’t be possible. Also backfills are not possible in real-time
x
I would suggest to have a hybrid table if we foresee there could be data rollup/backfill happen for the table. Otherwise, it doesn’t matter too much.
a
In our use there is no backfill or pre-aggregation. If it doesn't matter then, we will go with 4 year retention period.
s
@Apoorva Moghey How many segments do you create per day? (leading to) What do you expect the number of segments to be in 4 years? Are you ok with the lantency of searching that many segments? You may want to size your segments reasonably in case you are going for.a 4 year retention. I have modified the RealtimeProvisioningHelper tool (and also documented it in the latest doc). You may want to give it a spin to see what your hardware sizing looks like, and how many segments will be searched for your queries.
a
Copy code
How many segments do you create per day?
(leading to) What do you expect the number of segments to be in 4 years?
I don't have these numbers as of now, we just started exploring this new use case. Since this is a customer facing use case we are not ok with high latency. Definitely we will check
RealtimeProvisioningHelper
. As I understood higher number of segment then higher the latency. So we need to carefully decide our segment size.
s
The number of rows in a segment and the overall number of segments should be balanced. A segment is nothing but shard. If you have too many small segments that will cause the per-segment overhead go go up, so it is better to combine and have a smaller number of larger segments. On the other hand, a segment is processed in one thread, so having too large a segment is not good for latency either. It is a balancing game, and you need to adjust it.
a
Understood.
s
That said, we are working on features to merge segments, so some of that can be useful in your use case. If your segment store has good redundancy, then the short kafka retention is less of a concern for your use case. Otherwise, I would prefer a hybrid table for the backup that it provides in terms of source data. From what you mention (a user transacts a few times a year) it seems like you may not get much aggregation in offline segments, but correct me if I am wrong.
a
yes it is true, user transacts avg 3 times a year. Neither we have any offline aggregation.
s
If you don't have sample data, you cannot make use of RealtimeProvisioningHelper. You may want to simulate sample data for a few days, and use that to run the tool
a
We have the sample data on how data will look alike. we just need to simulate with
RealtimeProvisioningHelper
s
please read through the docs and use the current checked in version. I can answer any questions if you like.
a
Sure @Subbu Subramaniam