Hi team, since upsert table maintains an in-memory...
# general
n
Hi team, since upsert table maintains an in-memory map from the primary key to the record location, is it still recommended to define an inverted index on the table’s primary key? The primary key is unique per record, so is it recommended to define a bloom filter on the primary key instead of an inverted index to save space or is O(n) space per index, where n is the number of records, considered normal?
🍷 1
r
don't create an inverted index on it
bloom filter helps find the segment
you can also find the server quickly if you add this configuration so the broker knows how to route queries https://docs.pinot.apache.org/operators/operating-pinot/tuning/routing
partitioning helps find the server, bloom filter helps find the segment, but you will still need to scan within the segment to find the record, to speed up intrasegment query time, consider sorting on the primary key
n
Example table:
employee
. Primary key:
employeeId
. Partitioning key:
companyId
. I have 8 partitions for
employee
table. This means records of around 1/8 of all companies goes into a single partition. As you mentioned, partitioning helps finds the server. We have also an inverted index on our partitioning key (
companyId
), since each partition contains the records of hundreds of companies but our queries are always restricted to a single company. I’ll add bloom filters on
employeeId
column to help prune segments that don’t contains records related to a specific employeeId or a set of employee Ids. I think there is no value in adding
companyId
to the set of bloom filter columns since it is highly unlikely that a segment does not contains at least a single record from each company. Note: as per docs, bloom filters can be applied only to dictionary-encoded columns. We want to avoid scanning all the content of a segment, can you elaborate on why sorted index is better than inverted index? Please note that each record has a unique primary key and our primary key is a Mongo 12-bytes ObjectId.
Plan to specify multiple replica set per table. instanceSelectorType will be set to “strictReplicaGroup” since we support upsert.
r
with an inverted index you end up with a one element bitmap per employee, with a sorted index you just binary search on the employee
however, given that a 2000 employee company is quite large, and assuming each employee is scoped to a single company, I wouldn't index on employee at all, but sort on company
👍 1
you will always provide the company id for the sake of partition selection anyway, and a sorted index access on company id will select 100s or small 1000s of records to scan for the employeeId
also that scan will be contiguous within the scope of a company, so it's the best case for scanning
If I were you, I would delay adding bloom filter on employee Id until you've sorted on companyId and seen how far that gets you
1
k
So you are looking up on a secondary key (employeeId).. so having bloomfilter on employee I’d is a good idea.
r
it could be worth it, but we've also had reports of a lot of time spent building bloom filters as well as a lot of time in segment pruning, and the tradeoff is against a binary search then a small contiguous scan
I wouldn't like to guess where the crossover point is so I'd try without first and see what the marginal benefit is
n
thanks for your input .. I will start w/ sorted index on company id and consider later a bloom filter on employee ID (to prune segments more effectively if needed)