Hi folks I have some questions about migrating data from a p Apache Pinot #general

Hi folks, I have some questions about migrating da...

Diogo Baeder

11/05/2021, 1:45 AM

Hi folks, I have some questions about migrating data from a previous database into Pinot: In my project, we'll start publishing data to a Pinot realtime table, but we also need to port historical data. For historical data, do you recommend using an OFFLINE table to be used in conjunction with the REALTIME table, or is it fine to port the historical data to the REALTIME table directly? What are the pros and cons for each approach? Thanks!

Mayank

11/05/2021, 2:01 AM

You could do either. Typically folks put historical data onto offline table, and just have the RT nodes ingest and serve fresh data

Diogo Baeder

11/05/2021, 12:54 PM

Ah, wait, so the RT nodes should be separate from the OL nodes? They're part of the same cluster, but have to be dedicated for either OL or RT? (Sorry for my ignorance)

Mayank

11/06/2021, 4:19 AM

Technically you could use the same instances, but in production setup depending on your requirements you may want to keep them separate

Diogo Baeder

11/06/2021, 12:52 PM

Got it. And do you guys have some experience with knowing the performance between storing segments on S3 vs HDFS? How do they compare, which storage do you recommend for an AWS-based cluster?

Mayank

11/06/2021, 3:18 PM

If you are in AWS, then you could go with S3. Pinot doesn’t really recommend one vs the other. It does require the system to be reliable and available though

Diogo Baeder

11/06/2021, 6:56 PM

Thanks man! And something I've been wondering is, how does it cache segments in memory, if it caches them at all? I'd like to have a better understanding about what I can expect from Pinot in terms of how it decides whether to load segments from deep storage vs reusing cached results or rows

Mayank

11/06/2021, 11:27 PM

Pinot does not cache query results because it does not make much sense for real-time systems where data is changing. Pinot servers store a local copy of the data on attached disk and serve from there. Deepstore is used by servers to download data in case of data loss or new servers joining the cluster etc

Diogo Baeder

11/07/2021, 12:22 AM

Ah, got it! Thanks!

Open in Slack

Previous Next