This message was deleted Apache Druid #general

Join Slack

This message was deleted.

# general

Slackbot

06/02/2023, 3:08 PM

This message was deleted.

Sergio Ferragut

06/02/2023, 8:16 PM

Interesting.... In a simple test on kttm-nested sample data, the explain for the first one is two nested

groupby

queries while the second one is a

timeseries

query. I'm not knowledgable enough about the inner working of each query type to know what conditions would cause one to be faster than the other. One thought: Perhaps groupBy in this scenario only needs to read the dictionary for the column since you are selecting a constant value and timeseries scans the rows? You mentioned "sometimes", have you figured out when one is better than the other? column cardinality? total rows? # of segments?

JRob

06/02/2023, 9:15 PM

I think the sweet spot is when cardinality is under 5 million. I just tested on another datasource with 650 million rows and cardinality of 20 million and the APPROX_COUNT_DISTINCT took 7.88 seconds while the GROUP BY took 65.89 seconds. I am wondering why APPROX_COUNT_DISTINCT doesn't just read the dictionary for the column.

JRob

06/02/2023, 9:17 PM

Oh, for the above test, I should add that the actual test included a CONCAT, so it was:

Copy code

SELECT COUNT(*)
FROM (
  SELECT 1
  FROM "datasource"
  WHERE __time >= CURRENT_TIMESTAMP - INTERVAL '1' DAY
  GROUP BY "Field1", "Field2", "Field3"
)

Took 65.89 seconds

Copy code

SELECT APPROX_COUNT_DISTINCT(CONCAT("Field1", "Field2", "Field3")
FROM "datasource"
WHERE __time >= CURRENT_TIMESTAMP - INTERVAL '1' DAY

Took 7.88 seconds

Gian Merlino

06/07/2023, 11:26 AM

note that

APPROX_COUNT_DISTINCT_DS_HLL

is generally faster & more accurate than

APPROX_COUNT_DISTINCT

Gian Merlino

06/07/2023, 11:27 AM

we're looking into changing the default, although it's tricky to do a migration path that maintains compatibility for various cases, so we haven't done it yet

Gian Merlino

06/07/2023, 11:29 AM

i suspect with that one you'll find it is more competitive with a plain

GROUP BY

at lower cardinality

Gian Merlino

06/07/2023, 11:32 AM

intuitively though what is going on is that for low cardinalities, the operations we need to do to update a hash table (

GROUP BY

) and the ops we need to update a sketch are similar in complexity, and both fit comfortably in a small amount of memory, so sketches don't have an inherent advantage. (in general, their inherent advantage is they use less memory)

Gian Merlino

06/07/2023, 11:32 AM

so the faster one will be down to which one is better optimized for that case where the number of unique things being tracked is relatively small

Gian Merlino

06/07/2023, 11:33 AM

GROUP BY

is pretty good at that, and

APPROX_COUNT_DISTINCT_DS_HLL

is generally better than

APPROX_COUNT_DISTINCT

Gian Merlino

06/07/2023, 11:33 AM

although i'm not sure which one between

GROUP BY

and

APPROX_COUNT_DISTINCT_DS_HLL

will be better or where a breakpoint, if any, is!

Gian Merlino

06/07/2023, 11:33 AM

so i'm curious if you try that out, how it goes 🙂

Open in Slack

Previous Next