Hey guys, we have a usecase of doing following que...
# general
s
Hey guys, we have a usecase of doing following query a lot:
Copy code
select distinct(<columnName>) from <tableName> limit 1000;
Is there an index that might reduce the latency? Assuming the cardinality of the column will rarely cross 1000
k
Will the query have filters
If not, it’s already optimized.. we can add an distinct implementation that leverages index for low cardinality columns.. that’s a great idea.. can you file a GitHub issue
2
j
are you able to accept an approximate result (within a few %)? distinctcounthll(field, 12) for example returns up to twice as fast on some tables i’m testing with, and it’s only off by 1
k
@Kishore G I assume the existing optimization depends on having a forward index on the column, but wanted to confirm.
@Johan Adami I’m surprised HLL would be significantly faster for a low cardinality column…is there some other optimization it does besides using HyperLogLog?
j
well it was anywhere for 10-50% faster, there were ~257 distinct values in the table/column i tested. but that’s all i ran. Are there even any other possible optimizations for distinct?
k
@Johan Adami - if the column has high cardinality, then partitioning by that column results in fewer distinct values per segment, which reduces the overhead of collecting and merging results on the broker. Or at least that’s what I assumed was why my results improved when I made that change.
j
This was a pre aggregated time column with no partition.
k
I am also surprised that HLL is faster for low cardinality columns unlesss HLL is column is using startree index
@Ken Krugler existing optimization is based on dictionary so the column has to be dictionary encoded.
But I am referring to the idea of enhancing distinct operations using inverted index
That will be amazing..
s
@Kishore G Sometimes the latency is < 100ms but for certain columns the query takes upto 2 sec. Do I have to have inverted index on all such columns
And, yes the query will have some adhoc filters. Data is partitioned by user_id which can go upwards of 150 mil
k
user_id its better to create a star-tree index with HLL column for user_id.. this will be approximate but very fast..
s
Got it, so in that case
select distinct(user_id)
will also be fast
k
yes