Hi guys! Got a question: is it possible to, someho...
# troubleshooting
d
Hi guys! Got a question: is it possible to, somehow, have a batch ingestion pipeline for daily ingestions (therefore meaning daily segments being created), but then, on a monthly basis, combine all segments for the previous month and delete the daily segments for it? I'll continue in this thread.
For example: suppose that I was creating one segment per day on last March, and then today I wanted to combine all the March daily segments into a single March month segment, then delete the daily March segments. Would this be possible somehow?
The reasoning behind this is: I need to have fresh daily data, but it's very likely that each day will end up with too small segments, and the queries I would be running against them would very frequently be involving data for full months - therefore hitting too many segments, e.g. 31 segments for March. So I thought that "smashing" them into monthly segments would optimize the segments sizes.
l
something like this? https://docs.pinot.apache.org/operators/operating-pinot/minion-merge-rollup-task we haven’t tried it but we were also thinking about merging segments
m
There’s a minion job for that
Yep @User beat me to me
d
Ah, nice, I'll take a look, thanks! I was taking a look at RealtimeToOfflineTask, but it didn't seem like what I was looking for...
m
We wrote up a recipe showing how to use it - https://dev.startree.ai/docs/pinot/recipes/merge-small-segments - I think it'll do what you want
d
Nice, thanks man!
Hey guys! If I use the merge rollup approach and have my segments partitioned by a certain column ("region"), does the rolling up of data keep respecting the partitions? Or does it just merge them as well? If they get merged together that would be a problem for me...
k
Just FYI, we have a similar use case. But for us, we’d wind up with > 1000 daily segments by the end of the month, which would be too many. So we just use our Flink workflow to rebuild the monthly segment every day, since it only takes 15 minutes or so, and then do a metadata push that causes those segments to be updated.
d
Ah, interesting, sounds like an option. You use Flink only because of the job scheduling, right?
k
Actually we use Flink to do a bunch of ETL-ish work, and some machine learning, in order to generate the segment data.
d
Got it. Thanks man!
@Ken Krugler sorry to bring this subject up again, but would you mind giving a quick description about how you guys do segment rebuilds, like, in terms of steps and tools/endpoints you use to do that? I may need to implement something similar, so I'd like to have some reference first before I start. Thanks! 🙂
k
We build CSV files using a Flink workflow, then convert these to segments using the Pinot Hadoop map-reduce job. The results are saved to HDFS (which is our deep store), then we run the Pinot admin tool to do a metadata push of the new/updated segments. This is all orchestrated using Airflow.
d
And how about the old segments? You just keep them there?
k
Currently yes (in HDFS). But with 5 years of active data, so far this hasn’t been much of an issue. We do remove the older segments from the table (via the REST api) when we roll into a new month.
d
Ah, cool. Doesn't that lead to duplicate data while the old segments still exist though?
k
Segments are partitioned by date, one per country x month (with sub-partitioning for US, since it’s so much bigger than others)
So new segments being added never contain duplicate data. When we update an existing month, then it’s a replace of the segment (because the name is the same)
d
Ah, so you don't use segments that end up containing the same data then? I thought you were ingesting daily data and having them as daily segments, and only later converting to monthly segments...
k
No, we regenerate each current month on a daily basis, since the Flink workflow and segment building is pretty fast (typically < 15 minutes, a bit longer near the end of the month)
So that simplifies things, at the cost of slightly longer build times.
d
Oh! Nice, that's a very interesting approach!
So, just to confirm: you don't even use the batch ingestion process from Pinot, rather you have your own process for generating segments, which you keep pushing to Pinot, right?
k
Correct
We do a lot of manipulation of the source data before it gets into Pinot (in our Flink workflow). E.g. some machine learning, parsing of text, etc.
d
Awesome! Thanks a lot, man, that gives me enough reference and ideas to start working on my workflow too 😊
Ah, very interesting!
Anyway, that was very helpful, and I think I'll end up building my own segment generation flow too. Thanks, and have a great weekend!
@Ken Krugler I just managed to build a workflow that works fine for me too - by using the
ingestFromURI
endpoint and defining fixed segment names. I experimented with changing the data then re-uploading segments, and it worked beautifully; With this, I'll do the same as you: keep generating monthly segments and refreshing them every day, or when we need to make changes to the data - like remove or update rows, for example. Really nice! 🙂
k
@Diogo Baeder - glad it’s working for you!
❤️ 1