https://pinot.apache.org/ logo
e

Elon

08/05/2020, 12:30 AM
We are using pinot to store log data and noticed that string columns are truncated at 512 characters. Is there another datatype or setting we should use to increase the length?
n

Neha Pawar

08/05/2020, 12:36 AM
e

Elon

08/05/2020, 12:37 AM
Thanks!
I guess the default is 512 for string?
Ah, I see it in FieldSpec
n

Neha Pawar

08/05/2020, 12:38 AM
yes
Copy code
private static final int DEFAULT_MAX_LENGTH = 512;
// NOTE: for STRING column, this is the max number of characters; for BYTES column, this is the max number of bytes
  private int _maxLength = DEFAULT_MAX_LENGTH;
e

Elon

08/05/2020, 12:39 AM
Would there be any issues with specifying a really large value? We are going to try out using the text index - using it to store log records
n

Neha Pawar

08/05/2020, 12:46 AM
not sure. @Mayank @Jackie ?
e

Elon

08/05/2020, 12:46 AM
We're trying 2mb out, we can let you know also 🙂
But if you have any serious reservations let me know
And thanks for your help again!
s

Sidd

08/05/2020, 12:54 AM
@Elon, are you going to use text_match?
e

Elon

08/05/2020, 12:54 AM
Yep:)
s

Sidd

08/05/2020, 12:55 AM
There is a writer config that allows easy handling for large records (> 1MB per record)
I recommend enabling that
e

Elon

08/05/2020, 12:55 AM
Oh nice! What's that?
s

Sidd

08/05/2020, 12:56 AM
What's the largest length (in terms of number of characters) for a cell in this column?
e

Elon

08/05/2020, 12:56 AM
We're trying 250k characters
i.e. ~2mb
s

Sidd

08/05/2020, 12:56 AM
are they all standard ascii?
e

Elon

08/05/2020, 12:57 AM
utf8
is that ok?
s

Sidd

08/05/2020, 12:57 AM
then fine. I asked because UTF-8 although takes 1 byte per character for standard ASCII, it also takes upto 4 bytes for higher range
so in that 250k characters kind of translate to 5MB
e

Elon

08/05/2020, 12:58 AM
Makes sense
What's the magic config? 🙂
s

Sidd

08/05/2020, 12:58 AM
please enable the text index as follows
Copy code
"fieldConfigList": [
        {
          "encodingType": "RAW",
          "indexType": "TEXT",
          "name": "yourTextColumn",
          "properties": {
            "deriveNumDocsPerChunkForRawIndex": "true"
          }
        }
      ]

      "tableIndexConfig": {
        "noDictionaryColumns": [
          "yourTextColumn"
        ]
      }
Set this in table config
since the raw data is huge in size, I would not recommend creating dictionary on that
since it is essentially a log and you will be doing arbitrary text search as opposed to fixed/exact matches
e

Elon

08/05/2020, 12:59 AM
Thanks, this is great!
s

Sidd

08/05/2020, 1:00 AM
The docs already talk about how to enable text index but since I added the writer to handle blob like text later, that is probably not in the docs. I will add
Please let me know if you have any questions
j

Jackie

08/05/2020, 1:00 AM
Is
rawIndexWriterVersion
mandatory?
@Sidd Should we fix the version in the table config?
s

Sidd

08/05/2020, 1:05 AM
So using V3 rawIndexWriterVersion may not be necessary if the total size of raw data for text column isn't > 2GB
e

Elon

08/05/2020, 1:05 AM
Does it hurt to include that though?
k

Kishore G

08/05/2020, 1:06 AM
This is an amazing thread, please add this to docs
👍 1
e

Elon

08/05/2020, 1:07 AM
You are all so helpful, this is incredible, everyone wants to use pinot here... for everything :)
🍷 2
s

Sidd

08/05/2020, 1:08 AM
So here is the thing: it doesn't hurt since it has been tested and that's what we are using at Li. In case there is some other bug (for some reason) and you ever decide to downgrade/rollback Pinot version, we will have a problem since V3 version uses 8 byte offsets and the prior version uses 4 byte. The writers are backward compatible but they aren't (and can't be) forward compatible. So that's why the rollback problem
@Elon, how many rows (roughly) per segment?
j

Jackie

08/05/2020, 1:12 AM
Are we planning to move to the new version in the new release?
e

Elon

08/05/2020, 1:15 AM
We set the segment size at 200mb, and most of the log records are tiny, like 500 chars, 2kb per row so ~ 100k rows per segment, is that a good number?
I don't think we will ever roll back to a previous version, we will keep moving forward 🙂
s

Sidd

08/05/2020, 1:17 AM
Good to know. I just wanted to know if the total size is getting huge and you need the V3 writer. Use this config for now, push the data and let's test
e

Elon

08/05/2020, 4:15 PM
Text index is working great. Before the text index was created each segment was 365mb, now it's 9-11mb and I looked at the metadata.properties files in each directory, each segment is about 100k rows
Is there a compaction job that runs?
s

Sidd

08/05/2020, 4:16 PM
There should be an increase in size
in the segment root directory, what does du -sh . indicate?
cd segment/v3, du -sh .
e

Elon

08/05/2020, 4:28 PM
~11mb per segment
Just to be sure I can create another table without the text index to compare side by side
And why do the segments seem to be capped at 100k rows, atleast metadata.properties reports the highest rowcount at 100k exactly for all segments except the consuming one
s

Sidd

08/05/2020, 4:30 PM
yeah, you may want to look at the size of segment with and without text index.
For the offline segments, the number of rows per segment is dependent on the user's data prep job. So user has control how and what size (numrows) segments are created in offline flow
👍 1
e

Elon

08/05/2020, 4:36 PM
For this table we only have realtime
n

Neha Pawar

08/05/2020, 4:59 PM
regarding 100k rows cap, you likely have segment size based threshold enabled for creating segments in realtime. this happens if property “realtime.flush.threshold.size”=0. Pinot will slowly ramp up number of rows in the segment (with initial size 100k) until it gets to the desired size
e

Elon

08/05/2020, 5:01 PM
Thanks! That's looks like exactly what's happening:)
We have this as the config for threasholds:
Copy code
"realtime.segment.flush.threshold.time": "6h",
        "realtime.segment.flush.threshold.size": "0",
        "realtime.segment.flush.desired.size": "200M",
👌 1