Hello team :wave: We have `offline tables in Pino...
# troubleshooting
n
Hello team 👋 We have
offline tables in Pinot
with invertedIndexColumns, sortedColumn and segmentPartition (with Murmur based partitions) enabled. We also have instanceSelectorType as "replicaGroup". We've currently setup
createInvertedIndexDuringSegmentGeneration
flag to
false
by default. Is there a recommended approach to set this flag to
true
and also, what is the expected behavior? Will it be beneficial to enable it to minimize index creation after segments are loaded onto servers? Appreciate any help regarding this 🙇‍♀️
m
It depends on what kind of workload you are running. If it is mission critical with very high throughput and extremely low latency (in ms range), you might avoid creating index during load. But most use cases don’t fall under that category, and should be able to build index while loading. You can also check the resource usage of your servers to see how much headroom you have
n
Hi @Mayank 👋 This is for the same use-case that we met last week lead by @Nikhil (Pinot for user-facing analytics at Slack). Do you think it will be beneficial to turn this flag to true for our use-case? Mainly what would be the benefit of doing this...
m
For your use case, you may start with having this as
false
. The tradeoff is essentially pushing larger segment (with index), vs creating index on server (cpu/mem on server). Also, independently, the primary reason server has the capability to build index during loading is because it provides the flexibility to change indexing as the use case evolves, without needed to re-bootstrap the data.
👍 1
k
FWIW, we did find that pre-creating the indexes (while building segments using the Hadoop workflow) gave us the ability to throw more CPU resources at the task, versus using our Pinot cluster. It mattered when we were bulk loading a lot of data.
n
Thank you @Mayank & @Ken Krugler! @Ken Krugler what was the size of data when you found
createInvertedIndexDuringSegmentGeneration
to be beneficial? FYI, we use spark pinot batch ingestion job for building & pushing segments to Pinot.
k
I don’t have a hard number in mind, but we were pushing several hundred segments during a batch update and this was causing problems.
n
Thanks @Ken Krugler - could you please clarify what you meant by "this was causing problems"?
Also, can this flag be changed to true later (as needed), after a table is created & loaded?
n
you can change it to true later. but your older segment tar files still won’t have it, and it is not possible to backfill those. also side note, we’re trying to slowly deprecate this config, and have it default to always create during segment creation.
n
Based on the plan for this deprecation, is the recommendation to have
createInvertedIndexDuringSegmentGeneration
flag set to true always? Just trying to evaluate on what to expect if we start having this true for our new tables given that we've been having this false for our existing Pinot offline tables. FYI, we use pinot spark batch ingestion job for building & pushing segments to Pinot.
k
@Neeraja Sridharan - we ran into issues with both CPU load (Pinot becoming unresponsive to queries) and running out of memory, as there’s some amount of JVM ram needed to build the inverted index (I believe it varies with the size of the segment) and the number of parallel segments being processed during the push.
Re handling older segments that don’t have this flag set - when we switched to always creating the index during segment generation, we regenerated & reloaded the older segments. As long as the segment names didn’t change, this was a clean update.
n
@Ken Krugler Thanks for sharing these details!! We'll start with the default
false
value & update to
true
as needed (if we hit issues with Pinot CPU load/running out of memory).
n
atm, if you dont care too much about upload time (which will increase due to larger segment, as a result if you have multiple segments for a window you’re uploading, you could have incomplete data view for that window), compared to query performance during upload (which will take a hit if server has to build indexes), then i’d go with the latter
n
Makes sense! Thanks @Neha Pawar!