hey friends, I was reading this <doc> about Pinot ...
# getting-started
l
hey friends, I was reading this doc about Pinot managed flows, and I see the recommendation in general for pinot is to have a hybrid table, for our use case we are planning to have a rentention of 2 years, and so far we only had pinot running on realtime setup. With this I have so many question, I definitely see why we should use offline tables as well as realtime ones and move data from realtime to online once certain time threshold has been met, with these I have several questions. 1. Once in production, how can we move all the completed segments from realtime tables to offline ones. (given that our app has been ingesting data for sometime already) 2. Is a 2 year retention for a realtime table just too much? 3. Are we expected to see performance hit from queries that hit the offline tables? 4. is indexing, and partitioning also available on an offline setup? 5. Do you see benefits in terms of storage once you move data to offline tables?
just bumping the above
m
Oh apologies for missing it, reading.
Copy code
1. We should be able to set realtime to offline job on existing table (will confirm).
2. Yeah typically you want to have a small retention for realtime servers.
3. No perf hit, but curious why you'd think that?
4. Yes
5. If there's scope to merge and rollup then yes.
Can confirm 1.
l
thank you! for your answer to number 1. is that what it’s shown in the pinot managed flows? is basically that setup or is it something else, if so do you have an example so that we could implement? for number 2, we want to keep data only for 2 years overall, how do you come up with a feasible retention for the realtime tables then? for number 3, oh I was just wondering given that I was not sure if partitioning and all worked the same way given that the data being moved wouldn’t be coming from kafka anymore and I thought that partitioning from kafka was correlated in how things get partitioned in pinot as well my offline table setup would be similar to the online one but I guess the only difference is that the offline would have different retention policies as well as not consuming from a realtime table right?
m
1. It is just setup. One thing to note is that if you have too much data, you may start with a more frequent job, and then move to a less frequent schedule. 2. Once you have offline table also, you should keep retention of realtime to be small (say 5 days or so), so most data will be served from offline.
Yes, setup will be similar. Other than what you mentioned, you probably want to use separate server tags for offline tables, so that offline and realtime tables are not colocated.
It is not a requirement, just a recommendation.
l
Hey Mayank thank you so much for your thoughtful responses, one other question i had, How much do i get from the colocation, say if I only have 2 severs (so offline tables and realtime tables are not colocated) also, when you said 1. about setup, you meant it’s the https://docs.pinot.apache.org/operators/operating-pinot/pinot-managed-offline-flows config yea?
m
Not sure if I followed
How much do i get from the colocation
? Ideally you’d like to avoid colocation as the workload footprint of the two is somewhat different (due to consumption).
l
I have 2 servers would my data only have 2 places to really live in either server 1 or 2 how colocation means online segments wouldn't live in the server as offline ones yes?
m
Yes with two servers, it will live in either one, unless you do colocation. The reason I recommend not colocation is, like I mentioned, realtime servers are also consuming, so you want them to server less data from the same node.
l
i see that makes sense to me thank you very much Mayank