Hi guys Got a question is it possible to somehow have a batc Apache Pinot #troubleshooting

Hi guys! Got a question: is it possible to, someho...

Diogo Baeder

04/01/2022, 9:31 PM

Hi guys! Got a question: is it possible to, somehow, have a batch ingestion pipeline for daily ingestions (therefore meaning daily segments being created), but then, on a monthly basis, combine all segments for the previous month and delete the daily segments for it? I'll continue in this thread.

Diogo Baeder

04/01/2022, 9:32 PM

For example: suppose that I was creating one segment per day on last March, and then today I wanted to combine all the March daily segments into a single March month segment, then delete the daily March segments. Would this be possible somehow?

Diogo Baeder

04/01/2022, 9:34 PM

The reasoning behind this is: I need to have fresh daily data, but it's very likely that each day will end up with too small segments, and the queries I would be running against them would very frequently be involving data for full months - therefore hitting too many segments, e.g. 31 segments for March. So I thought that "smashing" them into monthly segments would optimize the segments sizes.

Luis Fernandez

04/01/2022, 9:37 PM

something like this? https://docs.pinot.apache.org/operators/operating-pinot/minion-merge-rollup-task we haven’t tried it but we were also thinking about merging segments

Mayank

04/01/2022, 9:37 PM

There’s a minion job for that

Mayank

04/01/2022, 9:38 PM

Yep @User beat me to me

Diogo Baeder

04/01/2022, 9:39 PM

Ah, nice, I'll take a look, thanks! I was taking a look at RealtimeToOfflineTask, but it didn't seem like what I was looking for...

Mark Needham

04/01/2022, 10:56 PM

We wrote up a recipe showing how to use it - https://dev.startree.ai/docs/pinot/recipes/merge-small-segments - I think it'll do what you want

Diogo Baeder

04/01/2022, 11:01 PM

Nice, thanks man!

Diogo Baeder

04/02/2022, 1:00 PM

Hey guys! If I use the merge rollup approach and have my segments partitioned by a certain column ("region"), does the rolling up of data keep respecting the partitions? Or does it just merge them as well? If they get merged together that would be a problem for me...

Ken Krugler

04/02/2022, 3:45 PM

Just FYI, we have a similar use case. But for us, we’d wind up with > 1000 daily segments by the end of the month, which would be too many. So we just use our Flink workflow to rebuild the monthly segment every day, since it only takes 15 minutes or so, and then do a metadata push that causes those segments to be updated.

Diogo Baeder

04/02/2022, 5:00 PM

Ah, interesting, sounds like an option. You use Flink only because of the job scheduling, right?

Ken Krugler

04/02/2022, 7:05 PM

Actually we use Flink to do a bunch of ETL-ish work, and some machine learning, in order to generate the segment data.

Diogo Baeder

04/02/2022, 7:46 PM

Got it. Thanks man!

Diogo Baeder

05/06/2022, 10:38 PM

@Ken Krugler sorry to bring this subject up again, but would you mind giving a quick description about how you guys do segment rebuilds, like, in terms of steps and tools/endpoints you use to do that? I may need to implement something similar, so I'd like to have some reference first before I start. Thanks! 🙂

Ken Krugler

05/06/2022, 10:50 PM

We build CSV files using a Flink workflow, then convert these to segments using the Pinot Hadoop map-reduce job. The results are saved to HDFS (which is our deep store), then we run the Pinot admin tool to do a metadata push of the new/updated segments. This is all orchestrated using Airflow.

Diogo Baeder

05/06/2022, 10:52 PM

And how about the old segments? You just keep them there?

Ken Krugler

05/06/2022, 10:53 PM

Currently yes (in HDFS). But with 5 years of active data, so far this hasn’t been much of an issue. We do remove the older segments from the table (via the REST api) when we roll into a new month.

Diogo Baeder

05/06/2022, 10:55 PM

Ah, cool. Doesn't that lead to duplicate data while the old segments still exist though?

Ken Krugler

05/06/2022, 10:55 PM

Segments are partitioned by date, one per country x month (with sub-partitioning for US, since it’s so much bigger than others)

Ken Krugler

05/06/2022, 10:56 PM

So new segments being added never contain duplicate data. When we update an existing month, then it’s a replace of the segment (because the name is the same)

Diogo Baeder

05/06/2022, 10:58 PM

Ah, so you don't use segments that end up containing the same data then? I thought you were ingesting daily data and having them as daily segments, and only later converting to monthly segments...

Ken Krugler

05/06/2022, 10:59 PM

No, we regenerate each current month on a daily basis, since the Flink workflow and segment building is pretty fast (typically < 15 minutes, a bit longer near the end of the month)

Ken Krugler

05/06/2022, 10:59 PM

So that simplifies things, at the cost of slightly longer build times.

Diogo Baeder

05/06/2022, 11:00 PM

Oh! Nice, that's a very interesting approach!

Diogo Baeder

05/06/2022, 11:01 PM

So, just to confirm: you don't even use the batch ingestion process from Pinot, rather you have your own process for generating segments, which you keep pushing to Pinot, right?

Ken Krugler

05/06/2022, 11:01 PM

Correct

Ken Krugler

05/06/2022, 11:02 PM

We do a lot of manipulation of the source data before it gets into Pinot (in our Flink workflow). E.g. some machine learning, parsing of text, etc.

Diogo Baeder

05/06/2022, 11:02 PM

Awesome! Thanks a lot, man, that gives me enough reference and ideas to start working on my workflow too 😊

Diogo Baeder

05/06/2022, 11:02 PM

Ah, very interesting!

Diogo Baeder

05/06/2022, 11:04 PM

Anyway, that was very helpful, and I think I'll end up building my own segment generation flow too. Thanks, and have a great weekend!

Diogo Baeder

05/08/2022, 5:05 AM

@Ken Krugler I just managed to build a workflow that works fine for me too - by using the

ingestFromURI

endpoint and defining fixed segment names. I experimented with changing the data then re-uploading segments, and it worked beautifully; With this, I'll do the same as you: keep generating monthly segments and refreshing them every day, or when we need to make changes to the data - like remove or update rows, for example. Really nice! 🙂

Ken Krugler

05/08/2022, 8:18 PM

@Diogo Baeder - glad it’s working for you!

❤️ 1

Open in Slack

Previous Next