https://pinot.apache.org/ logo
#pinot-docsrus
Title
# pinot-docsrus
k

Ken Krugler

11/21/2021, 4:39 PM
Hi all (and specifically @User) - I was talking about
ID_SET
support in Pinot with co-workers, and tried to point them at the relevant documentation…which I couldn’t find (also
IN_ID_SET
, not sure what else might be relevant). Does this exist somewhere that Google can’t see?
m

Mark Needham

11/21/2021, 5:16 PM
👍 1
m

Mayank

11/21/2021, 10:28 PM
Thanks @User
m

Mark Needham

11/22/2021, 11:38 AM
@User here we go: https://docs.pinot.apache.org/users/user-guide-query/querying-pinot#filtering-with-idset https://docs.pinot.apache.org/users/user-guide-query/querying-pinot#sub-query-filtering-with-idset I wonder whether we should make a separate page to explain the options for filtering, but for now I've put it on the same page as the other querying examples
🙏 1
k

Ken Krugler

11/22/2021, 3:46 PM
Hi @User - that’s really good, thanks! Two other items that you might want to cover… 1. I thought that for id set queries other than IN_PARTITIONED_SUBQUERY, you could use an id set from another table, which would be a big win for some use cases. 2. Also I thought @User said something about integers being better (faster?) than strings for id sets.
m

Mark Needham

11/22/2021, 3:48 PM
1. cool, good idea, lemme do that 2. when I tried to create an example with strings the serialized IdSet value was insanely large compared to the integer equivalent even with very few values!
k

Ken Krugler

11/22/2021, 4:02 PM
For #2, I think that’s because a Bloom filter is used for non-integers, and the size of that is set by these two parameters:
Copy code
expectedInsertions (5,000,000 by default): Number of expected insertions for the BloomFilter, must be positive
fpp (0.03 by default): Desired false positive probability for the BloomFilter, must be positive and less than 1.0
So if you set
expectedInsertions
to say 10, I imagine you’d get a much smaller serialized size.
j

Jackie

11/22/2021, 6:19 PM
@User is correct. For data types other than INT and LONG, we directly start with a bloom filter with 5M expected insertions and 0.03 fpp by default
@User Thanks for adding the docs. I think we should also document this behavior and optional arguments so that people can tune the performance accordingly
j

Jonathan Meyer

11/23/2021, 9:53 AM
Thanks @User !
m

Mark Needham

11/23/2021, 12:22 PM
k

Ken Krugler

11/23/2021, 3:07 PM
Hi @User better, thanks. A few comments… 1. Default values for the tuning parameters? 2. Explain that
sizeThresholdInBytes
is used for int & long, and controls when it switches to a Bloom Filter. 3. Note that when a Bloom Filter is used, the filter results are approximate - you can get false positive results (for membership in the set), leading to potentially unexpected results. 4. Cover when the id set can be built from a different table than the one being used for filtering 5. You say “When creating an IdSet for values in non INT/LONG columns, we can configure the expectedInsertions and fpp parameters”, but the example only sets the expectedInsertions. And it’s a bit confusing, in that these parameters would only come into play if the RoaringXXX data structure exceeds the
sizeThresholdInBytes
limit, which has a default of 8MB. 6. You say “The generated IdSet for the first query will be smaller as it will only contain the ids for the partitions served by the server.“. I think it would be clearer if you said “…for the subQuery will be smaller”.
@User thanks for adding this functionality! In thinking about using this support for one of our use cases, I wish I could configure it to fail if the Roaringxxx data set exceeds the
sizeThresholdInBytes
limit. That way I could guarantee that the results were accurate, versus potentially getting an unexpected and confusing result.
m

Mark Needham

11/23/2021, 3:12 PM
1. the defaults are described in the signature - I can add them in the explanation too? 2. good idea, let me do that. 3 . will do. 4. ditto 5. I think it always uses the bloom filter for non int/long values 6. makes sense
k

Ken Krugler

11/23/2021, 3:22 PM
1. The signature I saw was
ID_SET(columnName, 'sizeThresholdInBytes=1000;expectedInsertions=10000;fpp=0.03' )
, but I thought
expectedInsertions
default was 5M, and
sizeThreadholdInBytes
was 8MB. Oh, I see you edited these, cool. 2. Sorry, missed the
non
bit in the description in my fifth item.
m

Mark Needham

11/23/2021, 3:23 PM
yeh you're right
updated the signature!
I thought the values in the google doc were the defaults, but they weren't 'haha
j

Jackie

11/23/2021, 6:04 PM
@User We can add another optional argument to throw exception whenever the size exceeds the threshold, should be straight forward
k

Ken Krugler

11/23/2021, 6:14 PM
Hi @User - ok, looks good. I’d still love to see something about being able to use ids from a different table, as a way of doing a cross-table filter (except when using
IN__PARTITIONED__SUBQUERY
).
m

Mark Needham

11/23/2021, 9:19 PM
ok, lemme think of an example for that one
x

xtrntr

11/24/2021, 8:54 PM
it would be great if
IN_SUBQUERY
could support use cases where your sub query has a group by / having clause: e.g.
Copy code
... WHERE IN_SUBQUERY(yearID, "SELECT yearID, count(*) from ... GROUP BY yearID HAVING yearID >= 10) ...
thanks for documenting this btw
1