Hello ^^ Not really an issue just checking if this is normal Apache Pinot #troubleshooting

Hello ^^ Not really an issue, just checking if thi...

Jonathan Meyer

06/15/2021, 6:45 PM

Hello ^^ Not really an issue, just checking if this is "normal behavior"

select DISTINCT(kpi) from kpis

takes ~6ms (with 100M docs, &

numDocsScanned: 100000

) - this query returns 45 strings only But doing

select DISTINCT(kpi) from kpis ORDER BY kpi

takes >300ms (50 times slower) - It scans every documents (

numDocsScanned: 101250000

) I guess the

ORDER BY

breaks some optimizations down But from the outside it seems like pretty surprising behavior (sorting 45 strings "should not take this long" is what I mean) Anyway, not here to complain, just wanted to point it out in case it would be considered as something worth investigating

Mayank

06/15/2021, 6:50 PM

Is the cardinality of kpi only 45 and both return just 45 values?

✔️ 1

Mayank

06/15/2021, 6:50 PM

If so, it does seem like some room for optimization. Mind filing an issue?

Jonathan Meyer

06/15/2021, 6:50 PM

Yes, only 45 different 'kpis' exist in all 100M docs

Jonathan Meyer

06/15/2021, 6:51 PM

@Mayank Okay ! I'll file this one and another sometime this week 😄

Mayank

06/15/2021, 6:51 PM

How many segments do you have?

Jonathan Meyer

06/15/2021, 6:51 PM

Mayank

06/15/2021, 6:51 PM

numDocsScanned: 100000

seems to suggest early bailout

➕ 1

Jonathan Meyer

06/15/2021, 6:51 PM

select COUNT(DISTINCT(kpi)) from kpis

-> 45

select COUNT(DISTINCT(kpi)) from kpis ORDER BY kpi

-> 45

Jonathan Meyer

06/15/2021, 6:52 PM

Ah, got something interesting

Jonathan Meyer

06/15/2021, 6:52 PM

With the

COUNT

, queries are equally as fast

Mayank

06/15/2021, 6:52 PM

count without predicate is just answered via metadata (unless hybrid table, where implicit time predicate added for offline/realtime queries)

Mayank

06/15/2021, 6:53 PM

are you seeing the second query to be consistently slower?

Jonathan Meyer

06/15/2021, 6:53 PM

Oh, thanks for the info, learning everyday 🙂

Jonathan Meyer

06/15/2021, 6:53 PM

In case you want to have a look now, here are query logs with trace

Jonathan Meyer

06/15/2021, 6:54 PM

With

ORDER BY

Untitled

Jonathan Meyer

06/15/2021, 6:55 PM

Without

ORDER BY

Untitled

Jonathan Meyer

06/15/2021, 6:57 PM

are you seeing the second query to be consistently slower?

Yes, consistently in the 300-350ms range While the other one is in the 7-13ms range

Kishore G

06/15/2021, 7:15 PM

It’s a simple optimization.. we use dictionary to solve the query if there is no predicate..

Kishore G

06/15/2021, 7:15 PM

But looks like we also check for no order by for that optimization to kick in

Kishore G

06/15/2021, 7:16 PM

@Mayank should we a simple fix to enhance the optimizer

Mayank

06/15/2021, 7:17 PM

Yeah I was about to look at the code on whether we do dictionary based for distinct or not.

Mayank

06/15/2021, 7:17 PM

If not then something else might be going on

Kishore G

06/15/2021, 7:19 PM

We do

Kishore G

06/15/2021, 7:19 PM

It’s just that if there is order by then we fallback to full scan

Mayank

06/15/2021, 7:20 PM

Distinct count we do, distinct was added much later, so not sure

Mayank

06/15/2021, 7:20 PM

Copy code

public static boolean isFitForDictionaryBasedComputation(String functionName) {
    //@formatter:off
    return functionName.equalsIgnoreCase(AggregationFunctionType.MIN.name())
        || functionName.equalsIgnoreCase(AggregationFunctionType.MAX.name())
        || functionName.equalsIgnoreCase(AggregationFunctionType.MINMAXRANGE.name())
        || functionName.equalsIgnoreCase(AggregationFunctionType.DISTINCTCOUNT.name())
        || functionName.equalsIgnoreCase(AggregationFunctionType.SEGMENTPARTITIONEDDISTINCTCOUNT.name());

Mayank

06/15/2021, 7:20 PM

Apparently not for

distinct

Mayank

06/15/2021, 7:21 PM

Which is what I was suspecting to begin with, there's something else going on

Kishore G

06/15/2021, 7:22 PM

Interesting.. definitely worth looking into.. let’s continue on GitHub issue.

Kishore G

06/15/2021, 7:22 PM

@Jonathan Meyer can you please an issue

Jonathan Meyer

06/15/2021, 7:30 PM

@Kishore G @Mayank Sure, will do soon Happy to see I brought up an interesting topic 🙂

Jonathan Meyer

06/15/2021, 7:36 PM

-> https://github.com/apache/incubator-pinot/issues/7060

👍 3

Open in Slack

Previous Next