Hi team since upsert table maintains an in memory map from t Apache Pinot #general

Hi team, since upsert table maintains an in-memory...

Nizar Hejazi

04/22/2022, 9:29 AM

Hi team, since upsert table maintains an in-memory map from the primary key to the record location, is it still recommended to define an inverted index on the table’s primary key? The primary key is unique per record, so is it recommended to define a bloom filter on the primary key instead of an inverted index to save space or is O(n) space per index, where n is the number of records, considered normal?

🍷 1

Richard Startin

04/22/2022, 9:36 AM

don't create an inverted index on it

Richard Startin

04/22/2022, 9:36 AM

bloom filter helps find the segment

Richard Startin

04/22/2022, 9:38 AM

you can also find the server quickly if you add this configuration so the broker knows how to route queries https://docs.pinot.apache.org/operators/operating-pinot/tuning/routing

Richard Startin

04/22/2022, 9:39 AM

partitioning helps find the server, bloom filter helps find the segment, but you will still need to scan within the segment to find the record, to speed up intrasegment query time, consider sorting on the primary key

Nizar Hejazi

04/22/2022, 9:53 AM

Example table:

employee

. Primary key:

employeeId

. Partitioning key:

companyId

. I have 8 partitions for

employee

table. This means records of around 1/8 of all companies goes into a single partition. As you mentioned, partitioning helps finds the server. We have also an inverted index on our partitioning key (

companyId

), since each partition contains the records of hundreds of companies but our queries are always restricted to a single company. I’ll add bloom filters on

employeeId

column to help prune segments that don’t contains records related to a specific employeeId or a set of employee Ids. I think there is no value in adding

companyId

to the set of bloom filter columns since it is highly unlikely that a segment does not contains at least a single record from each company. Note: as per docs, bloom filters can be applied only to dictionary-encoded columns. We want to avoid scanning all the content of a segment, can you elaborate on why sorted index is better than inverted index? Please note that each record has a unique primary key and our primary key is a Mongo 12-bytes ObjectId.

Nizar Hejazi

04/22/2022, 9:55 AM

Plan to specify multiple replica set per table. instanceSelectorType will be set to “strictReplicaGroup” since we support upsert.

Richard Startin

04/22/2022, 12:44 PM

with an inverted index you end up with a one element bitmap per employee, with a sorted index you just binary search on the employee

Richard Startin

04/22/2022, 12:45 PM

however, given that a 2000 employee company is quite large, and assuming each employee is scoped to a single company, I wouldn't index on employee at all, but sort on company

👍 1

Richard Startin

04/22/2022, 12:46 PM

you will always provide the company id for the sake of partition selection anyway, and a sorted index access on company id will select 100s or small 1000s of records to scan for the employeeId

Richard Startin

04/22/2022, 12:48 PM

also that scan will be contiguous within the scope of a company, so it's the best case for scanning

Richard Startin

04/22/2022, 12:49 PM

If I were you, I would delay adding bloom filter on employee Id until you've sorted on companyId and seen how far that gets you

➕ 1

Kishore G

04/22/2022, 1:46 PM

So you are looking up on a secondary key (employeeId).. so having bloomfilter on employee I’d is a good idea.

Richard Startin

04/22/2022, 2:56 PM

it could be worth it, but we've also had reports of a lot of time spent building bloom filters as well as a lot of time in segment pruning, and the tradeoff is against a binary search then a small contiguous scan

Richard Startin

04/22/2022, 2:57 PM

I wouldn't like to guess where the crossover point is so I'd try without first and see what the marginal benefit is

Nizar Hejazi

04/22/2022, 5:10 PM

thanks for your input .. I will start w/ sorted index on company id and consider later a bloom filter on employee ID (to prune segments more effectively if needed)

3 Views

Open in Slack

Previous Next