Hi team. I have a column with very high cardinalit...
# troubleshooting
p
Hi team. I have a column with very high cardinality. It is defined as a MV column in my schema. We also have an inverted index defined on it. However, we want to disable dictionary on this column as it’s very high cardinality and the dictionary size is almost 50% of the total segment size. However, Pinot does not let me disable it and errors out with the following message: “Cannot create an Inverted index on column tags__KEYS specified in the noDictionaryColumns config” Can I disable dict. of this column and still have inverted index defined on it? Also, when should we use a forward index vs. an inverted index?
k
We have recently added feature to address this issue in out main branch.
optimizeDictionaryForMetrics
config in https://docs.pinot.apache.org/configuration-reference/table This only works for Single-valued columns though currently. For your second question, forward index are created by default. Inverted index should be used when you want to do a constant time lookup for a column value. https://docs.pinot.apache.org/basics/indexing/inverted-index
Inverted index unfortunately can't be created on no dictionary columns. So you will need to have a dictionary for it to work. @User Is there a better solution here?
p
So you will need to have a dictionary for it to work.
In our case, we have around 800M segments out of which 400M is the size of the dict due to the cardinality (700k unique values). Can we somehow optimise this?
Like we want the inverted index without the dict. Dict. anyway wouldn’t be that much helpful due to the high cardinality I presume? The docs themselves say that if there is very less duplicate data, the compression is not that useful.
k
Yeah, Having such large dicts doesn't make sense. I am not sure though why is inverted index not support on Raw columns. Can I ask one thing? If it is a high cardinality columns, do you do a lot of lookups on that? Or mostly range filters?
p
Yes we do have a lot of filters. We have two such columns:
tag__KEYS
: Stores keys of your tags.
tag__VALUES
: Stores values of those the corresponding tags in
tag__KEYS
. So basically, they look like:
Copy code
tag__KEYS: "key1, key2, key3"
tag__VALUES: "val1, val2, val3"
Almost all of our queries have predicates on these two columns.
m
Are you using Map type here? Seems like you can use JSON type which can have json index on no dict column
@User ^^
p
@User Sorry was AFK for a long time. Yes true in fact that is exactly what is in the roadmap. But that’ll take some time. I was looking at some quick solution that could reduce the size of my segments by almost 50% 🙂.
m
Short of that, I don’t think think there’s a quick solution there.