Is there a way to have an inverted index for a col...
# pinot-perf-tuning
k
Is there a way to have an inverted index for a column, but not store the column data? So a pure filter-only field?
m
Inv index has dict id to docIds mapping. You need dictionary to store values. This is the current implementation
k
Not yet, it’s not hard to go this.. file an issue
k
@Kishore G would it make sense to add a “noStorageColumns” config setting for tables?
k
yes, something along those lines
also, add some points on why this feature is important
is it purely storage on disk? bcos Pinot will not read the forward index if its never accessed in query
m
Just for my understanding, is this a request for sparse dictionary? I am missing something- don’t we need to have some storage for values to be able to reference them from queries?
Oh, so have the inv index but not the fwd index
k
@Mayank there are three things Forward index, dictionary, inverted index
m
Yeah got it
k
what @Ken Krugler is asking for is not to store the forward index
m
Yes. It would be great to see how much storage is being used for fwd index in your case @Ken Krugler. Index_map file inside segment dir has that info.
k
I must be missing something, given the above discussion 🙂 In Lucene, you can have a field in an index which only has the terms-to-docIdSet mappings, but without any stored data. Given what you said above, it sounds like the equivalent is to have the forward dictionary (so you have dict ids) and the inverted index (to map from a dict id to a set of doc ids), but no actual data, yes?
m
What you are referring to as actual data maps to fwd index. That also does not have data, it is not encoded dict ids per docId
To add more detail:
Copy code
Dictionary: value to id map
Fwd index: for each docId - dictId
Inv index: for each dictId -> list of docIds.
With dictionary encoding, and bit encoding (10 bits can represent 2^10 unique values), you can get compression.
k
We have a multi-valued field, so in that case the fwd index is what?
m
So the question to you is:
Copy code
Are you trying to reduce storage cost? If so, the only thing you can eliminate is fwdIndex - let's check its size for your segments.
k
We’re blowing the 2gb limit for a column. So yes, I guess you’d call that a “storage cost” 🙂
m
for MV: You can think of docId - [list of dictIds]
For that, you might want to reduce num docs per segment instead.
k
When our next build succeeds (where increasing number of segments) I can check the fwd index sizes
m
Do you have star tree?
k
Yes, though not with that column
m
Ok, then I am curious to know the metadata for that column (cardinality, etc)
k
We’ve found that having # of segments <= number of available server threads really helps our query performance, thus the balancing act with segment size
roughly 300K unique terms
(it’s a text field that we’re tokenizing/normalizing)
m
Do you have text index for that column?
k
No - it would be huge, and all we really need is term-level filtering
m
Also, it might not make sense to have dictionary on that column (if you have filters on other columns)
Ok, then explore no-dict index for that column
Ok, once you have the index generated, please share the metadata.properties for that column.
That will help me understand if no-dict or some other index might be better for that column
k
OK, thanks
m
For dict, we pad strings to make them same length, and that could lead to lot of storage wastage.
Metadata will tell us
k
wow, yes that would be an issue
I could throw in a filter to remove long terms, which would also help
m
no-dict will eliminate padding and hence reduce size. But there is no inv index for no-dict so you would need to rely on setting index on other columns
for high cardinality with uneven string size, no-dict give better overall size
k
Right, but sounds like what would be the best match for our use case would be a dictionary + inv index, without the forward index.
m
I think the wastage from padding in dictionary might be the root cause, and if so, removing fwd index won’t help
Let’s look at the index sizes and metadata once we have that
k
Max term length is 20, average term length is 6, so assume 14 bytes/term * 400K terms is 5.6MB
And yes, agree that examining the metadata is the right next step.
m
Sounds good
k
From metadata.properties:
Copy code
column.landingPageText_terms.cardinality = 144997
column.landingPageText_terms.totalDocs = 6100482
column.landingPageText_terms.dataType = STRING
column.landingPageText_terms.bitsPerElement = 18
column.landingPageText_terms.lengthOfEachEntry = 45
column.landingPageText_terms.columnType = DIMENSION
column.landingPageText_terms.isSorted = false
column.landingPageText_terms.hasNullValue = false
column.landingPageText_terms.hasDictionary = true
column.landingPageText_terms.textIndexType = NONE
column.landingPageText_terms.hasInvertedIndex = true
column.landingPageText_terms.isSingleValues = false
column.landingPageText_terms.maxNumberOfMultiValues = 4984
column.landingPageText_terms.totalNumberOfEntries = 312834131
column.landingPageText_terms.isAutoGenerated = false
column.landingPageText_terms.maxValue = \uFF42\uFF49\uFF5A
column.landingPageText_terms.defaultNullValue = null
And top four columns by size:
Copy code
landingPageText_terms.forward_index.size	743576242
destinationUrl.forward_index.size	199861484
creativeText.forward_index.size	99124657
imageUrl.forward_index.size	95375146
m
What about inv index and dict size for landingPageText?
Oh it is multi valued?
k
yes
m
My guess inv index might be even bigger
Can you share inv index and dict size?
k
So where is inv index size?
m
Index_map file
k
All I’ve got for landingPageText_terms is:
Copy code
landingPageText_terms.dictionary.startOffset = 406482808
landingPageText_terms.dictionary.size = 6524873
landingPageText_terms.forward_index.startOffset = 413007681
landingPageText_terms.forward_index.size = 743576242
m
No inv index on this column?
k
Hmm, says
hasInvertedIndex = true
from metadata file
column.landingPageText_terms.hasInvertedIndex = true
m
Unfortunately it always does
If index map file does not show it and you don’t have it in indexing config, then there is no inv index
k
and that column is in the `tableIndexConfig`’s
invertedIndexColumns
list
m
Hmm
Oh there’s this config to generate inv index offline vs in server during loading
k
I didn’t build these segments, someone else at the company did, but I believe the tableIndexConfig matches
m
But if index map does not have that info then it is not built yet
k
ah, right
"createInvertedIndexDuringSegmentGeneration": false,
m
Typically inv index size for MV columns might be bigger than fwd index
k
Should be a dict id, and a bitset, right?
(compressed bitset, like RoaringDocIdSet)
m
Yes
I have seen this pattern in the past, where server OOMs when building inv index of MV columns (2GB limit)
We get around by reducing num docs per segment
Is adding more cores to server not an option?
k
Adding more servers is an option, yes. Just trying to figure out bounds on what we can do here.
But forward index for this column is 750M (out of a total of 886M, for this segment), so getting rid of that would be nice.
m
Yes, agree, if you definitely need inv index on the column. Otherwise, we need to check which of the two is smaller (fwd vs inv)
k
we need to be able to filter using terms, so yes I think the inv index is a requirement
m
Well, if there are other filters in the query which eliminate a lot of rows, may be not
k
Is there documentation on the format of the forward index? I’m also curious how that gets compressed (using Snappy?), if at all.
m
Uses min number of bits to represent dictIds
There is no additional compression on top of that for dict columns
k
yeah, just seems like you’d need an additional table to map from docId to a bit offset into the bit-packed dictIds, and a count of how many dictIds exist for that docId. The Lucene index formats deal with similar issues, and get pretty complex trying to trade off size for lookup speed.
m
For SIngle value we don’t need not offset, MV yes
k
yes, I’m interested in the MV case
Which source file should I look at, if there’s no documentation?
m
FixedBitMVForwardIndexWriter
s
We recently did this for text index. But didn't remove the forward index completely. The raw text data was huge and was taking up ton of storage. So we stored dummy value in fwd index dictionay encoded
This was much easier than changing the semantics completely by not having the fwd index physically
k
Thanks @Sidd I guess I could look into that and see how hard it would be to do the same thing for an arbitrary column, given a table config setting.
@Sidd - where exactly in the code are you writing out a dummy fwd index for text columns?
s
This doesn' go all the way in not having the fwd index physically I am interested in seeing how we can possibly not have the fwd index at all and if it is worth it or not given that with the above change storage overhead is significantly reduced
👍 2
k
isn't it a matter of having a empty forward Index reader impl?