Hi all and specifically < UDT7GFEG6> I was talking about `ID Apache Pinot #pinot-docsrus

Hi all (and specifically <@UDT7GFEG6>) - I was tal...

Ken Krugler

11/21/2021, 4:39 PM

Hi all (and specifically @User) - I was talking about

ID_SET

support in Pinot with co-workers, and tried to point them at the relevant documentation…which I couldn’t find (also

IN_ID_SET

, not sure what else might be relevant). Does this exist somewhere that Google can’t see?

Ken Krugler

11/21/2021, 4:41 PM

I did find https://docs.google.com/document/d/1s6DZ9eTPqH7vaKQlPjKiWb_OBC3hkkEGICIzcd5gozc/edit#, but that’s not end-user documentation.

Mark Needham

11/21/2021, 5:16 PM

on my list of things to do! https://github.com/apache/pinot/issues/7789

👍 1

Mayank

11/21/2021, 10:28 PM

Thanks @User

Mark Needham

11/22/2021, 11:38 AM

@User here we go: https://docs.pinot.apache.org/users/user-guide-query/querying-pinot#filtering-with-idset https://docs.pinot.apache.org/users/user-guide-query/querying-pinot#sub-query-filtering-with-idset I wonder whether we should make a separate page to explain the options for filtering, but for now I've put it on the same page as the other querying examples

🙏 1

Ken Krugler

11/22/2021, 3:46 PM

Hi @User - that’s really good, thanks! Two other items that you might want to cover… 1. I thought that for id set queries other than IN_PARTITIONED_SUBQUERY, you could use an id set from another table, which would be a big win for some use cases. 2. Also I thought @User said something about integers being better (faster?) than strings for id sets.

Mark Needham

11/22/2021, 3:48 PM

1. cool, good idea, lemme do that 2. when I tried to create an example with strings the serialized IdSet value was insanely large compared to the integer equivalent even with very few values!

Ken Krugler

11/22/2021, 4:02 PM

For #2, I think that’s because a Bloom filter is used for non-integers, and the size of that is set by these two parameters:

Copy code

expectedInsertions (5,000,000 by default): Number of expected insertions for the BloomFilter, must be positive
fpp (0.03 by default): Desired false positive probability for the BloomFilter, must be positive and less than 1.0

So if you set

expectedInsertions

to say 10, I imagine you’d get a much smaller serialized size.

Jackie

11/22/2021, 6:19 PM

@User is correct. For data types other than INT and LONG, we directly start with a bloom filter with 5M expected insertions and 0.03 fpp by default

Jackie

11/22/2021, 6:21 PM

@User Thanks for adding the docs. I think we should also document this behavior and optional arguments so that people can tune the performance accordingly

Jonathan Meyer

11/23/2021, 9:53 AM

Thanks @User !

Mark Needham

11/23/2021, 12:22 PM

@User @User attempt #2 https://docs.pinot.apache.org/users/user-guide-query/filtering-with-idset

Ken Krugler

11/23/2021, 3:07 PM

Hi @User better, thanks. A few comments… 1. Default values for the tuning parameters? 2. Explain that

sizeThresholdInBytes

is used for int & long, and controls when it switches to a Bloom Filter. 3. Note that when a Bloom Filter is used, the filter results are approximate - you can get false positive results (for membership in the set), leading to potentially unexpected results. 4. Cover when the id set can be built from a different table than the one being used for filtering 5. You say “When creating an IdSet for values in non INT/LONG columns, we can configure the expectedInsertions and fpp parameters”, but the example only sets the expectedInsertions. And it’s a bit confusing, in that these parameters would only come into play if the RoaringXXX data structure exceeds the

sizeThresholdInBytes

limit, which has a default of 8MB. 6. You say “The generated IdSet for the first query will be smaller as it will only contain the ids for the partitions served by the server.“. I think it would be clearer if you said “…for the subQuery will be smaller”.

Ken Krugler

11/23/2021, 3:09 PM

@User thanks for adding this functionality! In thinking about using this support for one of our use cases, I wish I could configure it to fail if the Roaringxxx data set exceeds the

sizeThresholdInBytes

limit. That way I could guarantee that the results were accurate, versus potentially getting an unexpected and confusing result.

Mark Needham

11/23/2021, 3:12 PM

1. the defaults are described in the signature - I can add them in the explanation too? 2. good idea, let me do that. 3 . will do. 4. ditto 5. I think it always uses the bloom filter for non int/long values 6. makes sense

Ken Krugler

11/23/2021, 3:22 PM

1. The signature I saw was

ID_SET(columnName, 'sizeThresholdInBytes=1000;expectedInsertions=10000;fpp=0.03' )

, but I thought

expectedInsertions

default was 5M, and

sizeThreadholdInBytes

was 8MB. Oh, I see you edited these, cool. 2. Sorry, missed the

non

bit in the description in my fifth item.

Mark Needham

11/23/2021, 3:23 PM

yeh you're right

Mark Needham

11/23/2021, 3:23 PM

updated the signature!

Mark Needham

11/23/2021, 3:23 PM

I thought the values in the google doc were the defaults, but they weren't 'haha

Jackie

11/23/2021, 6:04 PM

@User We can add another optional argument to throw exception whenever the size exceeds the threshold, should be straight forward

Ken Krugler

11/23/2021, 6:14 PM

Hi @User - ok, looks good. I’d still love to see something about being able to use ids from a different table, as a way of doing a cross-table filter (except when using

IN__PARTITIONED__SUBQUERY

Mark Needham

11/23/2021, 9:19 PM

ok, lemme think of an example for that one

xtrntr

11/24/2021, 8:54 PM

it would be great if

IN_SUBQUERY

could support use cases where your sub query has a group by / having clause: e.g.

Copy code

... WHERE IN_SUBQUERY(yearID, "SELECT yearID, count(*) from ... GROUP BY yearID HAVING yearID >= 10) ...

xtrntr

11/24/2021, 8:55 PM

thanks for documenting this btw

➕ 1

4 Views

Open in Slack

Previous Next