Hello Pinot friends, can someone help me understan...
# general
b
Hello Pinot friends, can someone help me understand the use cases for noDictionaryColumns config? Would it make sense to to specify high cardinality columns such as event_id or large string columns as noDictionary?
m
Yes, that’s exactly the use case for noDictionary (high cardinality)
b
What are the downsides of it? I assume you wouldn't want to group by noDictionary fields?
m
You can certainly group-by. It doesn’t have inverted index, so ensure that there are other predicates in the query that filter out docs. (For text blobs you would probably build text index)
b
Ah ok, so you gain storage savings in the segment by losing the inverted index?
m
I wouldn’t put it that way, inv index is optional anyway.
You gain storage by eliminating dictionary, and padding in dictionary
You also eliminate dereferencing due to dictionary and potentially reduce disk seeks
b
got it, thanks @Mayank!
👍 2
k
if you are querying by that id, I would add bloom filter index on that column
1
👍 1
s
@Mayank if I have a large text field called "role" I was expecting setting it to noDictionary to have the effect of smaller segments. Maybe I'm missing something though, here is my tableIndexConfig
Copy code
"tableIndexConfig": {
    "noDictionaryColumns": [
      "role"
    ],
    "nullHandlingEnabled": false,
    "loadMode": "MMAP",
    "sortedColumn": ["tenant_name"],
    "enableDefaultStarTree": false,
    "streamConfigs": {
      "streamType": "kafka",
      "stream.kafka.consumer.type": "lowlevel",
      "stream.kafka.topic.name": "immutable-enriched-events",
      "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.inputformat.avro.confluent.KafkaConfluentSchemaRegistryAvroMessageDecoder",
      "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory",
      "stream.kafka.broker.list": "$BOOTSTRAP_SERVERS",
      "realtime.segment.flush.threshold.rows": "$REALTIME_SEGMENT_FLUSH_THRESHOLD_ROWS",
      "realtime.segment.flush.autotune.initialRows": "$REALTIME_SEGMENT_FLUSH_AUTOTUNE_INITIAL_ROWS",
      "realtime.segment.flush.threshold.time": "6h",
      "realtime.segment.flush.threshold.segment.size": "675M",
      "stream.kafka.consumer.prop.auto.offset.reset": "smallest",
      "stream.kafka.decoder.prop.schema.registry.rest.url": "$SCHEMA_REGISTRY_URL",
      "schema.registry.url": "$SCHEMA_REGISTRY_URL"
    }
  },
So I don't have any indexes configured so maybe noDictionary doesn't have an impact on anything but actual indexes?
m
Is role a string column with variable length and high cardinality? If so, the saving would be from eliminating the padding that dictionary adds to each value for making them all equal to the length of longest entry
s
Yes it’s my test column with high cardinality and length is random between 1k and 2k chars. I didn’t see a lot of size difference in my initial test though it was late today and maybe I missed something.
m
You can go within the segment dir, and look at the index_map file, that will list down all the dict/index sizes for each column.
Note, that there is a default 512 characters length restriction (that can be overwritten) in the schema.
s
Yep found that and set maxLength to 2000
👍 1
OK, I got this all sorted out and I tested w/ 500k rows using a random string b/w 1 and 2000 characters. Going from Dictionary to NoDictionary represented a 40% reduction in storage.
However, adding a high cardinality big string increased our storage by 288%. So for our use case (we were exploring having a big high cardinality field for drilldown use cases to keep us from having to hop to the source) we will most likely not add that new column and instead do the hop to the source.
m
The 288% increase is comparing with the case where the column does not exist in the table? If so, yeah.
s
Correct.
Basically adding a dictionary high cardinality big string (1,2000 random chars) increased our storage by 550%, when making that column no dictionary it increased it 288%.