If we want to get the total number of groups for a...
# general
k
If we want to get the total number of groups for a
group by
, I assume currently we have to do a separate
distinctcount
or
distinctcounthll
, right? But if the group by uses multiple columns, what’s the best approach to getting this total group count?
We can do
select distinctcount(concat(col1, col2, '|'))
and that works, but isn’t very fast.
j
Is the total number of groups very large? If not, you can set a relatively high
LIMIT
for the
GROUP BY
query or
DISTINCT
query to get all the groups
k
It can be large (e.g. > 16M for a test query I just tried)
j
Do you need accurate result or approximate is fine?
k
approximate is OK
j
Currently
distinctCount
family only support single column, we can probably extend that to support multiple columns
distinctcounthll(concat(col1, col2, '|'))
should be faster
k
Yes, that’s what I’ve been using. But I would expect built-in support for multiple columns to be faster still, yes?
And there’s no standard SQL support for returning the total number of groups from the original aggregation query, right?
j
In SQL, I think you still need to use
count(distinct col1, col2)
to get the distinct count
Which is mostly equivalent to
DISTINCT(col1, col2)
with high limit in Pinot
k
distinctcounthll(concat(col1, col2, '|'))
will suffer from Concat scalar udf
May be better to enhance distinctCountHLL to take multiple columns
k
Yes, that would be more performant. Should I file an issue?
k
Yes
And you can give it a shot, it’s not hard
k
I filed https://github.com/apache/incubator-pinot/issues/6676, hoping to get time to work on it 🙂