Ken Krugler11/21/2021, 4:39 PM
support in Pinot with co-workers, and tried to point them at the relevant documentation…which I couldn’t find (also
, not sure what else might be relevant). Does this exist somewhere that Google can’t see?
Ken Krugler11/22/2021, 3:46 PM
Ken Krugler11/22/2021, 4:02 PM
So if you set
expectedInsertions (5,000,000 by default): Number of expected insertions for the BloomFilter, must be positive fpp (0.03 by default): Desired false positive probability for the BloomFilter, must be positive and less than 1.0
to say 10, I imagine you’d get a much smaller serialized size.
Jackie11/22/2021, 6:19 PM
Jonathan Meyer11/23/2021, 9:53 AM
Ken Krugler11/23/2021, 3:07 PM
is used for int & long, and controls when it switches to a Bloom Filter. 3. Note that when a Bloom Filter is used, the filter results are approximate - you can get false positive results (for membership in the set), leading to potentially unexpected results. 4. Cover when the id set can be built from a different table than the one being used for filtering 5. You say “When creating an IdSet for values in non INT/LONG columns, we can configure the expectedInsertions and fpp parameters”, but the example only sets the expectedInsertions. And it’s a bit confusing, in that these parameters would only come into play if the RoaringXXX data structure exceeds the
limit, which has a default of 8MB. 6. You say “The generated IdSet for the first query will be smaller as it will only contain the ids for the partitions served by the server.“. I think it would be clearer if you said “…for the subQuery will be smaller”.
limit. That way I could guarantee that the results were accurate, versus potentially getting an unexpected and confusing result.
Ken Krugler11/23/2021, 3:22 PM
, but I thought
ID_SET(columnName, 'sizeThresholdInBytes=1000;expectedInsertions=10000;fpp=0.03' )
default was 5M, and
was 8MB. Oh, I see you edited these, cool. 2. Sorry, missed the
bit in the description in my fifth item.
Jackie11/23/2021, 6:04 PM
Ken Krugler11/23/2021, 6:14 PM
xtrntr11/24/2021, 8:54 PM
could support use cases where your sub query has a group by / having clause: e.g.
... WHERE IN_SUBQUERY(yearID, "SELECT yearID, count(*) from ... GROUP BY yearID HAVING yearID >= 10) ...