https://pinot.apache.org/ logo
#general
Title
# general
d

Diogo Baeder

12/11/2021, 6:02 PM
Hi folks, I got a question about publishing events for Pinot realtime tables. I have this situation where I have tons of analytics logs backed up, and I want to send all that to Pinot, and also start sending logs in realtime. I'm preparing my table threshold time for 24h and size for 200M, however it's not clear how I can set up the tables so that I can have a cleaner "1 day of data per segment" kind of deal. Should I perhaps use hybrid tables, where I would publish the old logs to the offline table, and live logs to the realtime table? What do you recommend me doing in this case where the logs are out-of-order for uploading from my backups?
m

Mayank

12/11/2021, 6:05 PM
If you would like cleaner 1 segment per day then you can achieve that in offline using hybrid table. for my understanding what’s the reason for 1 segment per day?
d

Diogo Baeder

12/11/2021, 6:08 PM
Probably just an OCD thing of mine, I was thinking that 1 day per segment would be ideal, but I might be totally wrong 😄 In certain cases we might have a full day of logs in one segment, but lately we've been growing in logs per day and I'm afraid we'll start hitting the 200M limit sooner than the 24h limit for committing the segment. So maybe it doesn't even make sense for me to think about "1 day per segment" at all. What do you think? What's your gut feeling about this?
m

Mayank

12/11/2021, 6:28 PM
Backfill becomes easy if your offline data is time partitioned. But you can still have multiple segments per day. Also 200M is not a hard limit. Depending on your use case you can go above or below it
k

Kishore G

12/11/2021, 7:50 PM
if you set up the realtime to offline task, it should automatically partition the data by day boundaries.. it will also break it up into multiple segments per day if there are too many rows on a given day
d

Diogo Baeder

12/11/2021, 9:57 PM
(Sorry for the delay) Yeah, it's just that I'm thinking about configuring it with:
Copy code
'realtime.segment.flush.threshold.rows': '0',
                    'realtime.segment.flush.threshold.time': '24h',
                    'realtime.segment.flush.desired.size': '200M',
because I thought 200M might be a good segment size... Thanks for the help, guys, let me study a bit more about those parts in order to have a better understanding of how I should approach all of that. Cheers!