Apache Pinot

Hey guys, we have a usecase of doing following query a lot:
```select distinct(&lt;columnName&gt;) from &lt;tableName&gt; limit 1000;```
Is there an index that might reduce the latency? Assuming the cardinality of the column will rarely cross 1000

If not, it’s already optimized.. we can add an distinct implementation that leverages index for low cardinality columns.. that’s a great idea.. can you file a GitHub issue 

are you able to accept an approximate result (within a few %)? <https://docs.pinot.apache.org/users/user-guide-query/how-to-handle-unique-counting|distinctcounthll>(field, 12) for example returns up to twice as fast on some tables i’m testing with, and it’s only off by 1

<@UDRJ7G85T> I assume the existing optimization depends on having a forward index on the column, but wanted to confirm.

<@U0264HWF9FS> I’m surprised HLL would be significantly faster for a low cardinality column…is there some other optimization it does besides using HyperLogLog?

well it was anywhere for 10-50% faster, there were ~257 distinct values in the table/column i tested. but that’s all i ran. Are there even any other possible optimizations for distinct?

<@U0264HWF9FS> - if the column has high cardinality, then partitioning by that column results in fewer distinct values per segment, which reduces the overhead of collecting and merging results on the broker. Or at least that’s what I assumed was why my results improved when I made that change.

This was a pre aggregated time column with no partition. 

I am also surprised that HLL is faster for low cardinality columns unlesss HLL is column is using startree index

<@U01ECR21TEE> existing optimization is based on dictionary so the column has to be dictionary encoded.

But I am referring to the idea of enhancing distinct operations using inverted index 

<@UDRJ7G85T> Sometimes the latency is &lt; 100ms but for certain columns the query takes upto 2 sec.  Do I have to have inverted index on all such columns

And, yes the query will have some adhoc filters. Data is partitioned by user_id which can go upwards of 150 mil

user_id its better to create a star-tree index with HLL column for user_id.. this will be approximate but very fast..

Got it, so in that case `select distinct(user_id)`  will also be fast