This message was deleted.
# general
s
This message was deleted.
s
Interesting.... In a simple test on kttm-nested sample data, the explain for the first one is two nested
groupby
queries while the second one is a
timeseries
query. I'm not knowledgable enough about the inner working of each query type to know what conditions would cause one to be faster than the other. One thought: Perhaps groupBy in this scenario only needs to read the dictionary for the column since you are selecting a constant value and timeseries scans the rows? You mentioned "sometimes", have you figured out when one is better than the other? column cardinality? total rows? # of segments?
j
I think the sweet spot is when cardinality is under 5 million. I just tested on another datasource with 650 million rows and cardinality of 20 million and the APPROX_COUNT_DISTINCT took 7.88 seconds while the GROUP BY took 65.89 seconds. I am wondering why APPROX_COUNT_DISTINCT doesn't just read the dictionary for the column.
Oh, for the above test, I should add that the actual test included a CONCAT, so it was:
Copy code
SELECT COUNT(*)
FROM (
  SELECT 1
  FROM "datasource"
  WHERE __time >= CURRENT_TIMESTAMP - INTERVAL '1' DAY
  GROUP BY "Field1", "Field2", "Field3"
)
Took 65.89 seconds
Copy code
SELECT APPROX_COUNT_DISTINCT(CONCAT("Field1", "Field2", "Field3")
FROM "datasource"
WHERE __time >= CURRENT_TIMESTAMP - INTERVAL '1' DAY
Took 7.88 seconds
g
note that
APPROX_COUNT_DISTINCT_DS_HLL
is generally faster & more accurate than
APPROX_COUNT_DISTINCT
we're looking into changing the default, although it's tricky to do a migration path that maintains compatibility for various cases, so we haven't done it yet
i suspect with that one you'll find it is more competitive with a plain
GROUP BY
at lower cardinality
intuitively though what is going on is that for low cardinalities, the operations we need to do to update a hash table (
GROUP BY
) and the ops we need to update a sketch are similar in complexity, and both fit comfortably in a small amount of memory, so sketches don't have an inherent advantage. (in general, their inherent advantage is they use less memory)
so the faster one will be down to which one is better optimized for that case where the number of unique things being tracked is relatively small
GROUP BY
is pretty good at that, and
APPROX_COUNT_DISTINCT_DS_HLL
is generally better than
APPROX_COUNT_DISTINCT
although i'm not sure which one between
GROUP BY
and
APPROX_COUNT_DISTINCT_DS_HLL
will be better or where a breakpoint, if any, is!
so i'm curious if you try that out, how it goes 🙂