Is it possible to have table sorted on a multi valued column Apache Pinot #general

Join Slack

Is it possible to have table sorted on a multi val...

# general

Ashish

11/26/2021, 4:32 PM

Is it possible to have table sorted on a multi valued column?

Sidd

11/26/2021, 4:34 PM

No, we don't support sorted index on multi value column

Ashish

11/26/2021, 4:35 PM

I see. Is it possible to add this support?

Sidd

11/26/2021, 4:38 PM

I think settling on semantics of what sorted means for multi valued column would be crucial and how the sort order will be leveraged during filter processing. I am guessing for your string column, you sorted each column array value and there is no order across column values. I suggest creating an issue to start a discussion thread.

Ashish

11/26/2021, 4:41 PM

Will create an issue. Yes, I just sorted by column array value.

Sidd

11/26/2021, 4:51 PM

I can't see how sorting a multi valued column will exactly help with performance. Dictionary and inverted index on a multi valued column is created in same manner as for a single valued column. This implies that predicate evaluation has no difference. It's the forward index that has different format for multi valued column. If let's say we sort it, then predicate evaluation across rows will still require inverted index. Anyway lets create an issue and discuss there. I don't remember if this came up before

Ashish

11/26/2021, 5:44 PM

Ok, maybe, I am stating my problem a bit differently. Here is the original issue that I was trying to figure out - for a multi valued column, the forward index is taking up a quite bit of space. I created a MV column that has cardinality of 10, dictionary takes 4 bits per element, and each row, say, contains exactly 5 elements in the array. For a segment with 100000 rows (docs), it ends up taking (4*5+2)*100000/8 bytes. It seems Pinot stores array of dictId + length in the forward index. I was expecting it to dictionary encode the array of dictId, as cardinality of MV column is not 100000.

Ashish

11/26/2021, 5:45 PM

I am ok with creating a inverse index for query speed up but I am mainly concerned with the space issue.

Ashish

11/26/2021, 5:49 PM

In the above example, each entry in the fwd index is taking up 3 bytes - while only 17 bits are need to represent 100,000 unique values and the cardinality of the MV column is much lesser than the number of docs in the segment.

Mayank

11/26/2021, 6:23 PM

If you dict encode on array, then you size of dictionary will become large. You should look at overall size of dict + fwd index for a column

Mayank

11/26/2021, 6:24 PM

Could you please share what that overall size would be in your case?

Mayank

11/26/2021, 6:26 PM

I don’t think sorted index is the answer here. But would like to understand your storage concern a bit more. What’s the overall data size and is this really exploding your data size?

Ashish

11/26/2021, 6:28 PM

Here is what I did: I created two MV columns (string, dimension) tags__KEYS, tags__VALUES and created another column seriesID is single value with content equal to k1=v1,k2=v2, … So for a row if tags__KEYS = k1,k2 and tags__VALUES=v1,v2, then seriesID=“k1=v1,k2=v2”.

Ashish

11/26/2021, 6:29 PM

The cardinality of tags__KEYS=10, cardinality of tags__VALUES=65, cardinality of seriesID=10000. Number of docs=720000.

Ashish

11/26/2021, 6:30 PM

With this data, here are sizes for the dictionary and fwd indices: index_map:tags__VALUES.forward_index.size = 7214060 index_map:tags__KEYS.forward_index.size = 4514060 index_map:seriesID.forward_index.size = 1260008 index_map:tags__KEYS.dictionary.size = 78 index_map:tags__VALUES.dictionary.size = 1048 index_map:seriesID.dictionary.size = 50008

Ashish

11/26/2021, 6:32 PM

So tags_KEYS and tags_VALUES end up taking 16 bytes per row, while the seriesID takes up total of 1.81 bytes per row.

Ashish

11/26/2021, 6:33 PM

That’s a factor 9

Ashish

11/26/2021, 6:33 PM

This is blowing up the space requirements

Ashish

11/26/2021, 6:34 PM

Here the cardinality of seriesID should be same as array of dictIds in tags_KEYS+tags+VALUES

Ashish

11/26/2021, 6:35 PM

If seriesID is treated as a dictionary id for tags_KEYS+tags_VALUES then we can also have a sorted range index on it.

Ashish

11/26/2021, 6:36 PM

When I upload a segment sorted by seriesId, even the 1.81 bytes per row usage for seriesId disappeared too!

Ashish

11/26/2021, 6:36 PM

That is what I want to achieve.

Ashish

11/26/2021, 6:42 PM

just to be clear, I am not storing the full k1=v1, k2=v2 in the seriesID - just the serial number.

Mayank

11/26/2021, 6:53 PM

Hmm, may be we should understand if the data modeling is optimal?

Mayank

11/26/2021, 8:21 PM

To close the thread: We sync'd up offline, and based on the requirements, a better data model is to use JSON index, that also seems to mitigate the storage issue.

4 Views

Open in Slack

Previous Next