We generate OFFLINE segments via Hadoop and sometimes these Apache Pinot #general

We generate OFFLINE segments via Hadoop, and somet...

Ken Krugler

07/06/2021, 6:03 PM

We generate OFFLINE segments via Hadoop, and sometimes these are updates to existing segments. In that case we want the segment names to match exactly (so that it’s an update). For most segments this is fine, as we partition by month. But there are cases where we also sub-partition by a non-date field. In this situation I don’t see a way to leverage the

SegmentNameGenerator

interface to give us a deterministic name. If we could key off of the input (CSV) file name then it would be easy, as we’ve got full control over that. Any ideas?

Mayank

07/06/2021, 6:05 PM

For REFRESH tables (which don't have time column), the segment naming scheme is something like <tableName>_idx. Does that not work?

Mayank

07/06/2021, 6:06 PM

BTW, there's an issue opened recently about the exact same requirement as yours https://github.com/apache/incubator-pinot/issues/7090

Mayank

07/06/2021, 6:06 PM

Looking for contributions 😉

Ken Krugler

07/06/2021, 6:07 PM

No, because our segment names will be something like

<table name>_<country>_YYYY-MM

but for the US it’s

<table name>_us_YYYY-MM_idx

, e.g.

ads_us_2020-08_0

Ken Krugler

07/06/2021, 6:08 PM

For cases where we don’t have that final index (sub-partition) it’s easy to ensure exact name matching. But with the US data, we need to sub-partition by a field we use frequently in star tree indexes, so that we get maximum gain.

Ken Krugler

07/06/2021, 6:09 PM

Thanks for the ref to the issue - yes, this is very similar to what we need.

Ken Krugler

07/06/2021, 6:14 PM

Added some questions to the issue you referenced.

Open in Slack

Previous Next