We generate OFFLINE segments via Hadoop, and somet...
# general
k
We generate OFFLINE segments via Hadoop, and sometimes these are updates to existing segments. In that case we want the segment names to match exactly (so that it’s an update). For most segments this is fine, as we partition by month. But there are cases where we also sub-partition by a non-date field. In this situation I don’t see a way to leverage the
SegmentNameGenerator
interface to give us a deterministic name. If we could key off of the input (CSV) file name then it would be easy, as we’ve got full control over that. Any ideas?
m
For REFRESH tables (which don't have time column), the segment naming scheme is something like <tableName>_idx. Does that not work?
BTW, there's an issue opened recently about the exact same requirement as yours https://github.com/apache/incubator-pinot/issues/7090
Looking for contributions 😉
k
No, because our segment names will be something like
<table name>_<country>_YYYY-MM
but for the US it’s
<table name>_us_YYYY-MM_idx
, e.g.
ads_us_2020-08_0
For cases where we don’t have that final index (sub-partition) it’s easy to ensure exact name matching. But with the US data, we need to sub-partition by a field we use frequently in star tree indexes, so that we get maximum gain.
Thanks for the ref to the issue - yes, this is very similar to what we need.
Added some questions to the issue you referenced.