Ken Krugler
11/21/2021, 4:39 PMID_SET
support in Pinot with co-workers, and tried to point them at the relevant documentation…which I couldn’t find (also IN_ID_SET
, not sure what else might be relevant). Does this exist somewhere that Google can’t see?Mark Needham
Mayank
Mark Needham
Ken Krugler
11/22/2021, 3:46 PMMark Needham
Ken Krugler
11/22/2021, 4:02 PMexpectedInsertions (5,000,000 by default): Number of expected insertions for the BloomFilter, must be positive
fpp (0.03 by default): Desired false positive probability for the BloomFilter, must be positive and less than 1.0
So if you set expectedInsertions
to say 10, I imagine you’d get a much smaller serialized size.Jackie
11/22/2021, 6:19 PMJonathan Meyer
11/23/2021, 9:53 AMMark Needham
Ken Krugler
11/23/2021, 3:07 PMsizeThresholdInBytes
is used for int & long, and controls when it switches to a Bloom Filter.
3. Note that when a Bloom Filter is used, the filter results are approximate - you can get false positive results (for membership in the set), leading to potentially unexpected results.
4. Cover when the id set can be built from a different table than the one being used for filtering
5. You say “When creating an IdSet for values in non INT/LONG columns, we can configure the expectedInsertions and fpp parameters”, but the example only sets the expectedInsertions. And it’s a bit confusing, in that these parameters would only come into play if the RoaringXXX data structure exceeds the sizeThresholdInBytes
limit, which has a default of 8MB.
6. You say “The generated IdSet for the first query will be smaller as it will only contain the ids for the partitions served by the server.“. I think it would be clearer if you said “…for the subQuery will be smaller”.sizeThresholdInBytes
limit. That way I could guarantee that the results were accurate, versus potentially getting an unexpected and confusing result.Mark Needham
Ken Krugler
11/23/2021, 3:22 PMID_SET(columnName, 'sizeThresholdInBytes=1000;expectedInsertions=10000;fpp=0.03' )
, but I thought expectedInsertions
default was 5M, and sizeThreadholdInBytes
was 8MB. Oh, I see you edited these, cool.
2. Sorry, missed the non
bit in the description in my fifth item.Mark Needham
Jackie
11/23/2021, 6:04 PMKen Krugler
11/23/2021, 6:14 PMIN__PARTITIONED__SUBQUERY
).Mark Needham
xtrntr
11/24/2021, 8:54 PMIN_SUBQUERY
could support use cases where your sub query has a group by / having clause: e.g.
... WHERE IN_SUBQUERY(yearID, "SELECT yearID, count(*) from ... GROUP BY yearID HAVING yearID >= 10) ...