https://pinot.apache.org/ logo
j

Jonathan Meyer

06/15/2021, 6:45 PM
Hello ^^ Not really an issue, just checking if this is "normal behavior"
select DISTINCT(kpi) from kpis
takes ~6ms (with 100M docs, &
numDocsScanned: 100000
) - this query returns 45 strings only But doing
select DISTINCT(kpi) from kpis ORDER BY kpi
takes >300ms (50 times slower) - It scans every documents (
numDocsScanned: 101250000
) I guess the
ORDER BY
breaks some optimizations down But from the outside it seems like pretty surprising behavior (sorting 45 strings "should not take this long" is what I mean) Anyway, not here to complain, just wanted to point it out in case it would be considered as something worth investigating
m

Mayank

06/15/2021, 6:50 PM
Is the cardinality of kpi only 45 and both return just 45 values?
✔️ 1
If so, it does seem like some room for optimization. Mind filing an issue?
j

Jonathan Meyer

06/15/2021, 6:50 PM
Yes, only 45 different 'kpis' exist in all 100M docs
@Mayank Okay ! I'll file this one and another sometime this week 😄
m

Mayank

06/15/2021, 6:51 PM
How many segments do you have?
j

Jonathan Meyer

06/15/2021, 6:51 PM
90
m

Mayank

06/15/2021, 6:51 PM
numDocsScanned: 100000
seems to suggest early bailout
1
j

Jonathan Meyer

06/15/2021, 6:51 PM
select COUNT(DISTINCT(kpi)) from kpis
-> 45
select COUNT(DISTINCT(kpi)) from kpis ORDER BY kpi
-> 45
Ah, got something interesting
With the
COUNT
, queries are equally as fast
m

Mayank

06/15/2021, 6:52 PM
count without predicate is just answered via metadata (unless hybrid table, where implicit time predicate added for offline/realtime queries)
are you seeing the second query to be consistently slower?
j

Jonathan Meyer

06/15/2021, 6:53 PM
Oh, thanks for the info, learning everyday 🙂
In case you want to have a look now, here are query logs with trace
With
ORDER BY
Without
ORDER BY
are you seeing the second query to be consistently slower?
Yes, consistently in the 300-350ms range While the other one is in the 7-13ms range
k

Kishore G

06/15/2021, 7:15 PM
It’s a simple optimization.. we use dictionary to solve the query if there is no predicate..
But looks like we also check for no order by for that optimization to kick in
@Mayank should we a simple fix to enhance the optimizer
m

Mayank

06/15/2021, 7:17 PM
Yeah I was about to look at the code on whether we do dictionary based for distinct or not.
If not then something else might be going on
k

Kishore G

06/15/2021, 7:19 PM
We do
It’s just that if there is order by then we fallback to full scan
m

Mayank

06/15/2021, 7:20 PM
Distinct count we do, distinct was added much later, so not sure
Copy code
public static boolean isFitForDictionaryBasedComputation(String functionName) {
    //@formatter:off
    return functionName.equalsIgnoreCase(AggregationFunctionType.MIN.name())
        || functionName.equalsIgnoreCase(AggregationFunctionType.MAX.name())
        || functionName.equalsIgnoreCase(AggregationFunctionType.MINMAXRANGE.name())
        || functionName.equalsIgnoreCase(AggregationFunctionType.DISTINCTCOUNT.name())
        || functionName.equalsIgnoreCase(AggregationFunctionType.SEGMENTPARTITIONEDDISTINCTCOUNT.name());
Apparently not for
distinct
Which is what I was suspecting to begin with, there's something else going on
k

Kishore G

06/15/2021, 7:22 PM
Interesting.. definitely worth looking into.. let’s continue on GitHub issue.
@Jonathan Meyer can you please an issue
j

Jonathan Meyer

06/15/2021, 7:30 PM
@Kishore G @Mayank Sure, will do soon Happy to see I brought up an interesting topic 🙂