Hi team I have a column with very high cardinality It is def Apache Pinot #troubleshooting

Hi team. I have a column with very high cardinalit...

Prashant Pandey

04/19/2022, 10:49 AM

Hi team. I have a column with very high cardinality. It is defined as a MV column in my schema. We also have an inverted index defined on it. However, we want to disable dictionary on this column as it’s very high cardinality and the dictionary size is almost 50% of the total segment size. However, Pinot does not let me disable it and errors out with the following message: “Cannot create an Inverted index on column tags__KEYS specified in the noDictionaryColumns config” Can I disable dict. of this column and still have inverted index defined on it? Also, when should we use a forward index vs. an inverted index?

Kartik Khare

04/19/2022, 11:09 AM

We have recently added feature to address this issue in out main branch.

optimizeDictionaryForMetrics

config in https://docs.pinot.apache.org/configuration-reference/table This only works for Single-valued columns though currently. For your second question, forward index are created by default. Inverted index should be used when you want to do a constant time lookup for a column value. https://docs.pinot.apache.org/basics/indexing/inverted-index

Kartik Khare

04/19/2022, 11:11 AM

Inverted index unfortunately can't be created on no dictionary columns. So you will need to have a dictionary for it to work. @User Is there a better solution here?

Prashant Pandey

04/19/2022, 11:16 AM

So you will need to have a dictionary for it to work.

In our case, we have around 800M segments out of which 400M is the size of the dict due to the cardinality (700k unique values). Can we somehow optimise this?

Prashant Pandey

04/19/2022, 11:18 AM

Like we want the inverted index without the dict. Dict. anyway wouldn’t be that much helpful due to the high cardinality I presume? The docs themselves say that if there is very less duplicate data, the compression is not that useful.

Kartik Khare

04/19/2022, 11:24 AM

Yeah, Having such large dicts doesn't make sense. I am not sure though why is inverted index not support on Raw columns. Can I ask one thing? If it is a high cardinality columns, do you do a lot of lookups on that? Or mostly range filters?

Prashant Pandey

04/19/2022, 11:26 AM

Yes we do have a lot of filters. We have two such columns:

tag__KEYS

: Stores keys of your tags.

tag__VALUES

: Stores values of those the corresponding tags in

tag__KEYS

. So basically, they look like:

Copy code

tag__KEYS: "key1, key2, key3"
tag__VALUES: "val1, val2, val3"

Almost all of our queries have predicates on these two columns.

Mayank

04/19/2022, 1:21 PM

Are you using Map type here? Seems like you can use JSON type which can have json index on no dict column

Mayank

04/19/2022, 1:21 PM

@User ^^

Prashant Pandey

04/19/2022, 6:57 PM

@User Sorry was AFK for a long time. Yes true in fact that is exactly what is in the roadmap. But that’ll take some time. I was looking at some quick solution that could reduce the size of my segments by almost 50% 🙂.

Mayank

04/19/2022, 11:38 PM

Short of that, I don’t think think there’s a quick solution there.

Open in Slack

Previous Next