https://pinot.apache.org/ logo
y

Yash Agarwal

03/17/2021, 4:29 PM
Hello, Is there any performance difference between the following two queries for pinot
select distinct city from transactions limit 100000
select city from transactions group by city limit 100000
k

Ken Krugler

03/17/2021, 4:38 PM
My guess was no, as implementation-wise it’s a similar operation. But just for grins I tried it on a large dataset (1.7b records) and got similar performance. I’d guess that memory usage would also be similar.
m

Mayank

03/17/2021, 4:39 PM
Hey distinct and group by use different engines internally, even though semantically they mean the same thing and might end up doing similar amount of work.
k

Ken Krugler

03/18/2021, 3:07 AM
Just FYI I wound up getting the same first 10 results from both queries, which is why I thought the underlying implementation was the same, since there’s no ordering for the results. But based on what @Mayank said that’s not the case.
m

Mayank

03/18/2021, 3:07 AM
Yeah, the Operators for group-by and Distinct are implemented separately
@Ken Krugler for the second query, did you run
select city
or
select count(*)
?
k

Ken Krugler

03/18/2021, 3:10 AM
select city
Though with my data set, so actually
select advertiser
m

Mayank

03/18/2021, 3:11 AM
Ok, my comments assumed the second query was aggregation group-by (didn't carefully see the second query).
I don't recall that we re-write the second query as a distinct internally, but I can check that
@Ken Krugler @Yash Agarwal I stand corrected. The Calcite parser re-writes the second query as distinct. In my previous comments, I thought the second query was a aggr-group-by query.
👍 1
k

Ken Krugler

03/18/2021, 3:19 AM
Thanks for checking, that explains why the results were so similar 🙂
m

Mayank

03/18/2021, 3:19 AM
Yep, that is what got me thinking as well.