Hello team wave We have `offline tables in Pinot` with inver Apache Pinot #troubleshooting

Hello team :wave: We have `offline tables in Pino...

Neeraja Sridharan

09/26/2022, 7:35 PM

Hello team 👋 We have

offline tables in Pinot

with invertedIndexColumns, sortedColumn and segmentPartition (with Murmur based partitions) enabled. We also have instanceSelectorType as "replicaGroup". We've currently setup

createInvertedIndexDuringSegmentGeneration

flag to

false

by default. Is there a recommended approach to set this flag to

true

and also, what is the expected behavior? Will it be beneficial to enable it to minimize index creation after segments are loaded onto servers? Appreciate any help regarding this 🙇‍♀️

Mayank

09/26/2022, 8:19 PM

It depends on what kind of workload you are running. If it is mission critical with very high throughput and extremely low latency (in ms range), you might avoid creating index during load. But most use cases don’t fall under that category, and should be able to build index while loading. You can also check the resource usage of your servers to see how much headroom you have

Neeraja Sridharan

09/26/2022, 8:27 PM

Hi @Mayank 👋 This is for the same use-case that we met last week lead by @Nikhil (Pinot for user-facing analytics at Slack). Do you think it will be beneficial to turn this flag to true for our use-case? Mainly what would be the benefit of doing this...

Mayank

09/26/2022, 10:08 PM

For your use case, you may start with having this as

false

. The tradeoff is essentially pushing larger segment (with index), vs creating index on server (cpu/mem on server). Also, independently, the primary reason server has the capability to build index during loading is because it provides the flexibility to change indexing as the use case evolves, without needed to re-bootstrap the data.

👍 1

Ken Krugler

09/26/2022, 11:09 PM

FWIW, we did find that pre-creating the indexes (while building segments using the Hadoop workflow) gave us the ability to throw more CPU resources at the task, versus using our Pinot cluster. It mattered when we were bulk loading a lot of data.

Neeraja Sridharan

09/27/2022, 1:43 AM

Thank you @Mayank & @Ken Krugler! @Ken Krugler what was the size of data when you found

createInvertedIndexDuringSegmentGeneration

to be beneficial? FYI, we use spark pinot batch ingestion job for building & pushing segments to Pinot.

Ken Krugler

09/27/2022, 2:53 PM

I don’t have a hard number in mind, but we were pushing several hundred segments during a batch update and this was causing problems.

Neeraja Sridharan

09/27/2022, 3:40 PM

Thanks @Ken Krugler - could you please clarify what you meant by "this was causing problems"?

Neeraja Sridharan

09/27/2022, 3:45 PM

Also, can this flag be changed to true later (as needed), after a table is created & loaded?

Neha Pawar

09/27/2022, 3:46 PM

you can change it to true later. but your older segment tar files still won’t have it, and it is not possible to backfill those. also side note, we’re trying to slowly deprecate this config, and have it default to always create during segment creation.

Neeraja Sridharan

09/27/2022, 4:08 PM

Based on the plan for this deprecation, is the recommendation to have

createInvertedIndexDuringSegmentGeneration

flag set to true always? Just trying to evaluate on what to expect if we start having this true for our new tables given that we've been having this false for our existing Pinot offline tables. FYI, we use pinot spark batch ingestion job for building & pushing segments to Pinot.

Ken Krugler

09/27/2022, 5:17 PM

@Neeraja Sridharan - we ran into issues with both CPU load (Pinot becoming unresponsive to queries) and running out of memory, as there’s some amount of JVM ram needed to build the inverted index (I believe it varies with the size of the segment) and the number of parallel segments being processed during the push.

Ken Krugler

09/27/2022, 5:18 PM

Re handling older segments that don’t have this flag set - when we switched to always creating the index during segment generation, we regenerated & reloaded the older segments. As long as the segment names didn’t change, this was a clean update.

Neeraja Sridharan

09/27/2022, 7:59 PM

@Ken Krugler Thanks for sharing these details!! We'll start with the default

false

value & update to

true

as needed (if we hit issues with Pinot CPU load/running out of memory).

Neha Pawar

09/27/2022, 8:15 PM

atm, if you dont care too much about upload time (which will increase due to larger segment, as a result if you have multiple segments for a window you’re uploading, you could have incomplete data view for that window), compared to query performance during upload (which will take a hit if server has to build indexes), then i’d go with the latter

Neeraja Sridharan

09/27/2022, 8:36 PM

Makes sense! Thanks @Neha Pawar!

Open in Slack

Previous Next