Slackbot
06/02/2023, 3:08 PMSergio Ferragut
06/02/2023, 8:16 PMgroupby
queries while the second one is a timeseries
query. I'm not knowledgable enough about the inner working of each query type to know what conditions would cause one to be faster than the other. One thought: Perhaps groupBy in this scenario only needs to read the dictionary for the column since you are selecting a constant value and timeseries scans the rows?
You mentioned "sometimes", have you figured out when one is better than the other? column cardinality? total rows? # of segments?JRob
06/02/2023, 9:15 PMJRob
06/02/2023, 9:17 PMSELECT COUNT(*)
FROM (
SELECT 1
FROM "datasource"
WHERE __time >= CURRENT_TIMESTAMP - INTERVAL '1' DAY
GROUP BY "Field1", "Field2", "Field3"
)
Took 65.89 seconds
SELECT APPROX_COUNT_DISTINCT(CONCAT("Field1", "Field2", "Field3")
FROM "datasource"
WHERE __time >= CURRENT_TIMESTAMP - INTERVAL '1' DAY
Took 7.88 seconds
Gian Merlino
06/07/2023, 11:26 AMAPPROX_COUNT_DISTINCT_DS_HLL
is generally faster & more accurate than APPROX_COUNT_DISTINCT
Gian Merlino
06/07/2023, 11:27 AMGian Merlino
06/07/2023, 11:29 AMGROUP BY
at lower cardinalityGian Merlino
06/07/2023, 11:32 AMGROUP BY
) and the ops we need to update a sketch are similar in complexity, and both fit comfortably in a small amount of memory, so sketches don't have an inherent advantage. (in general, their inherent advantage is they use less memory)Gian Merlino
06/07/2023, 11:32 AMGian Merlino
06/07/2023, 11:33 AMGROUP BY
is pretty good at that, and APPROX_COUNT_DISTINCT_DS_HLL
is generally better than APPROX_COUNT_DISTINCT
Gian Merlino
06/07/2023, 11:33 AMGROUP BY
and APPROX_COUNT_DISTINCT_DS_HLL
will be better or where a breakpoint, if any, is!Gian Merlino
06/07/2023, 11:33 AM