If I need to determine the number of groups from a...
# pinot-dev
k
If I need to determine the number of groups from an aggregation query, where the groups are filtered by aggregation result, are there any recommended approaches? E.g. group by minute, sum page views, and I only care about minutes with > 1000 page views - what’s a good way to determine the number of interesting minutes? Assume there can be many (e.g. > 1M groups) for my specific use case, so I can’t do an order by with some large limit.
m
Are you referring to the
Having
clause? If so, that is supported.
k
I can use
having <filter>
to restrict groups, but is there a way to count the number of groups without returning the groups (and using some arbitrarily huge limit)?
(side note - what page is
having
documented on?
m
Having
follows SQL syntax so might not be explicitly documented. But that brings up a good point, perhaps we should catalogue what is supported and what isn't (from SQL).
I am unsure if there's a better way to do what you want other than
Having
. Perhaps the
Having
implementation can be optimized to do filtering on server side (pre combine) if the query allows for it (for example if a monotonically increasing aggr function has a filter).
k
If the data is segmented by the group key then I would imagine server-side filtering is a possible optimization, otherwise I think you don’t know whether the aggregation results of the gather phase would pass the filter until all results have been combined from all servers. And that means an unbounded amount of memory for the priority map (or whatever is used to collect the results).
m
I was referring to cases like count(*) or sum (with +ve numbers) and filters like xxx > yyy where you can safely do filtering in scatter phase. Agreed though, you can't always do that.