Hello everyone. Is there any way to specify that w...
# troubleshooting
l
Hello everyone. Is there any way to specify that we do not want any indexes for a field? We are struggling with a very large text blob, which seems to be stored in the indexes folder on the servers. We want the data to only reside on our deep store, and not be stored on disk at all. I’ve tried adding the field to the noDictionaryColumns and setting the fieldConfig encodingType to RAW, but it still seems to be creating a forward index which is stored on disk. Any ideas?
Here’s output from ls -laH
Copy code
drwxr-xr-x 2 root root      4096 Mar  7 15:57 .
drwxr-xr-x 3 root root      4096 Mar  7 15:57 ..
-rw-r--r-- 1 root root 276774030 Mar  7 15:57 columns.psf
-rw-r--r-- 1 root root        16 Mar  7 15:57 creation.meta
-rw-r--r-- 1 root root      2617 Mar  7 15:57 index_map
-rw-r--r-- 1 root root     16669 Mar  7 15:57 metadata.properties
And looking at the index_map
Copy code
large_text_field.forward_index.startOffset = 17683196
large_text_field.forward_index.size = 255005500
From metadata.properties
Copy code
column.large_text_field.cardinality = -2147483648
column.large_text_field.totalDocs = 3629495
column.large_text_field.dataType = STRING
column.large_text_field.bitsPerElement = 31
column.large_text_field.lengthOfEachEntry = 0
column.large_text_field.columnType = DIMENSION
column.large_text_field.isSorted = false
column.large_text_field.hasNullValue = false
column.large_text_field.hasDictionary = false
column.large_text_field.textIndexType = NONE
column.large_text_field.hasInvertedIndex = true
column.large_text_field.hasFSTIndex = false
column.large_text_field.hasJsonIndex = false
column.large_text_field.isSingleValues = true
column.large_text_field.maxNumberOfMultiValues = 0
column.large_text_field.totalNumberOfEntries = 3629495
column.large_text_field.isAutoGenerated = false
column.large_text_field.defaultNullValue = null
Any help greatly appreciated
r
the forward index is just the storage for the column
l
Why is it storing it on disk?
r
it's called a forward index because it implicitly maps "forward" from the value to the docId
l
So there is no way to further reduce the disk space usage for that dimension field, except for enabling compression?
r
it should be compressed by default
l
Yeah that’s right, snappy compressed
The problem is it’s a very large base64 string
So my disks are exploding
r
we changed the default to LZ4 (LZ4_WITH_LENGTH is better) because it decompresses faster and has a better ratio
l
I’m trying to figure out if there is any way for me to prevent it from being stored on disk at all, and only retrieved from deep store
r
yes, let me look into it
l
Thank you 🙂
r
@User is the expert on this
l
👍 Thank you. I’ll wait for more info on this.. In the meantime, looking to upgrade to 0.10
Perhaps I need to look into making this specific field external to pinot
These text blobs are ~2-5 MB in size
So they really bloat my segments
r
there are some primitives in tiered storage for separating indexes from data, but I'm not sure where the boundaries are
l
Tiered storage is not OSS right?
r
by the way, I've found in the past that base64 encoding can confuse compression algorithms like LZ4 and SNAPPy because it scrambles data across byte boundaries those algorithms exploit
you can probably get a big improvement by not base64 encoding and changing it to BYTES
then apply LZ4_WITH_LENGTH and use V4 raw index
Tiered storage is not OSS right?
no, we have a proprietary implementation but the primitives to support that are open source
l
That makes sense. I could attempt doing that too