hey friends I was reading this <https docs google com docume Apache Pinot #getting-started

hey friends, I was reading this <doc> about Pinot ...

Luis Fernandez

02/02/2022, 9:46 PM

hey friends, I was reading this doc about Pinot managed flows, and I see the recommendation in general for pinot is to have a hybrid table, for our use case we are planning to have a rentention of 2 years, and so far we only had pinot running on realtime setup. With this I have so many question, I definitely see why we should use offline tables as well as realtime ones and move data from realtime to online once certain time threshold has been met, with these I have several questions. 1. Once in production, how can we move all the completed segments from realtime tables to offline ones. (given that our app has been ingesting data for sometime already) 2. Is a 2 year retention for a realtime table just too much? 3. Are we expected to see performance hit from queries that hit the offline tables? 4. is indexing, and partitioning also available on an offline setup? 5. Do you see benefits in terms of storage once you move data to offline tables?

Luis Fernandez

02/03/2022, 8:49 PM

just bumping the above

Mayank

02/03/2022, 8:56 PM

Oh apologies for missing it, reading.

Mayank

02/03/2022, 8:58 PM

Copy code

1. We should be able to set realtime to offline job on existing table (will confirm).
2. Yeah typically you want to have a small retention for realtime servers.
3. No perf hit, but curious why you'd think that?
4. Yes
5. If there's scope to merge and rollup then yes.

Mayank

02/03/2022, 9:03 PM

Can confirm 1.

Luis Fernandez

02/03/2022, 9:30 PM

thank you! for your answer to number 1. is that what it’s shown in the pinot managed flows? is basically that setup or is it something else, if so do you have an example so that we could implement? for number 2, we want to keep data only for 2 years overall, how do you come up with a feasible retention for the realtime tables then? for number 3, oh I was just wondering given that I was not sure if partitioning and all worked the same way given that the data being moved wouldn’t be coming from kafka anymore and I thought that partitioning from kafka was correlated in how things get partitioned in pinot as well my offline table setup would be similar to the online one but I guess the only difference is that the offline would have different retention policies as well as not consuming from a realtime table right?

Mayank

02/04/2022, 6:39 PM

1. It is just setup. One thing to note is that if you have too much data, you may start with a more frequent job, and then move to a less frequent schedule. 2. Once you have offline table also, you should keep retention of realtime to be small (say 5 days or so), so most data will be served from offline.

Mayank

02/04/2022, 6:40 PM

Yes, setup will be similar. Other than what you mentioned, you probably want to use separate server tags for offline tables, so that offline and realtime tables are not colocated.

Mayank

02/04/2022, 6:40 PM

It is not a requirement, just a recommendation.

Luis Fernandez

02/07/2022, 9:20 PM

Hey Mayank thank you so much for your thoughtful responses, one other question i had, How much do i get from the colocation, say if I only have 2 severs (so offline tables and realtime tables are not colocated) also, when you said 1. about setup, you meant it’s the https://docs.pinot.apache.org/operators/operating-pinot/pinot-managed-offline-flows config yea?

Mayank

02/07/2022, 11:25 PM

Not sure if I followed

How much do i get from the colocation

? Ideally you’d like to avoid colocation as the workload footprint of the two is somewhat different (due to consumption).

Mayank

02/07/2022, 11:25 PM

Yes - https://docs.pinot.apache.org/operators/operating-pinot/pinot-managed-offline-flows

Luis Fernandez

02/07/2022, 11:29 PM

I have 2 servers would my data only have 2 places to really live in either server 1 or 2 how colocation means online segments wouldn't live in the server as offline ones yes?

Mayank

02/07/2022, 11:46 PM

Yes with two servers, it will live in either one, unless you do colocation. The reason I recommend not colocation is, like I mentioned, realtime servers are also consuming, so you want them to server less data from the same node.

Luis Fernandez

02/08/2022, 5:11 PM

i see that makes sense to me thank you very much Mayank

Open in Slack

Previous Next