Hello Pinot friends can someone help me understand the use c Apache Pinot #general

Hello Pinot friends, can someone help me understan...

Bobby Richard

11/02/2022, 8:13 PM

Hello Pinot friends, can someone help me understand the use cases for noDictionaryColumns config? Would it make sense to to specify high cardinality columns such as event_id or large string columns as noDictionary?

Mayank

11/02/2022, 8:14 PM

Yes, that’s exactly the use case for noDictionary (high cardinality)

Bobby Richard

11/02/2022, 8:14 PM

What are the downsides of it? I assume you wouldn't want to group by noDictionary fields?

Mayank

11/02/2022, 8:15 PM

You can certainly group-by. It doesn’t have inverted index, so ensure that there are other predicates in the query that filter out docs. (For text blobs you would probably build text index)

Bobby Richard

11/02/2022, 8:16 PM

Ah ok, so you gain storage savings in the segment by losing the inverted index?

Mayank

11/02/2022, 8:17 PM

I wouldn’t put it that way, inv index is optional anyway.

Mayank

11/02/2022, 8:17 PM

You gain storage by eliminating dictionary, and padding in dictionary

Mayank

11/02/2022, 8:18 PM

You also eliminate dereferencing due to dictionary and potentially reduce disk seeks

Bobby Richard

11/02/2022, 8:18 PM

got it, thanks @Mayank!

👍 2

Kishore G

11/02/2022, 8:34 PM

if you are querying by that id, I would add bloom filter index on that column

➕ 1

👍 1

Stuart Millholland

11/30/2022, 10:14 PM

@Mayank if I have a large text field called "role" I was expecting setting it to noDictionary to have the effect of smaller segments. Maybe I'm missing something though, here is my tableIndexConfig

Stuart Millholland

11/30/2022, 10:14 PM

Copy code

"tableIndexConfig": {
    "noDictionaryColumns": [
      "role"
    ],
    "nullHandlingEnabled": false,
    "loadMode": "MMAP",
    "sortedColumn": ["tenant_name"],
    "enableDefaultStarTree": false,
    "streamConfigs": {
      "streamType": "kafka",
      "stream.kafka.consumer.type": "lowlevel",
      "stream.kafka.topic.name": "immutable-enriched-events",
      "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.inputformat.avro.confluent.KafkaConfluentSchemaRegistryAvroMessageDecoder",
      "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory",
      "stream.kafka.broker.list": "$BOOTSTRAP_SERVERS",
      "realtime.segment.flush.threshold.rows": "$REALTIME_SEGMENT_FLUSH_THRESHOLD_ROWS",
      "realtime.segment.flush.autotune.initialRows": "$REALTIME_SEGMENT_FLUSH_AUTOTUNE_INITIAL_ROWS",
      "realtime.segment.flush.threshold.time": "6h",
      "realtime.segment.flush.threshold.segment.size": "675M",
      "stream.kafka.consumer.prop.auto.offset.reset": "smallest",
      "stream.kafka.decoder.prop.schema.registry.rest.url": "$SCHEMA_REGISTRY_URL",
      "schema.registry.url": "$SCHEMA_REGISTRY_URL"
    }
  },

Stuart Millholland

11/30/2022, 10:15 PM

So I don't have any indexes configured so maybe noDictionary doesn't have an impact on anything but actual indexes?

Mayank

12/01/2022, 12:28 AM

Is role a string column with variable length and high cardinality? If so, the saving would be from eliminating the padding that dictionary adds to each value for making them all equal to the length of longest entry

Stuart Millholland

12/01/2022, 12:29 AM

Yes it’s my test column with high cardinality and length is random between 1k and 2k chars. I didn’t see a lot of size difference in my initial test though it was late today and maybe I missed something.

Mayank

12/01/2022, 12:30 AM

You can go within the segment dir, and look at the index_map file, that will list down all the dict/index sizes for each column.

Mayank

12/01/2022, 12:31 AM

Note, that there is a default 512 characters length restriction (that can be overwritten) in the schema.

Stuart Millholland

12/01/2022, 12:35 AM

Yep found that and set maxLength to 2000

👍 1

Stuart Millholland

12/01/2022, 5:17 PM

OK, I got this all sorted out and I tested w/ 500k rows using a random string b/w 1 and 2000 characters. Going from Dictionary to NoDictionary represented a 40% reduction in storage.

Stuart Millholland

12/01/2022, 5:18 PM

However, adding a high cardinality big string increased our storage by 288%. So for our use case (we were exploring having a big high cardinality field for drilldown use cases to keep us from having to hop to the source) we will most likely not add that new column and instead do the hop to the source.

Mayank

12/01/2022, 10:57 PM

The 288% increase is comparing with the case where the column does not exist in the table? If so, yeah.

Stuart Millholland

12/02/2022, 5:33 PM

Correct.

Stuart Millholland

12/02/2022, 5:35 PM

Basically adding a dictionary high cardinality big string (1,2000 random chars) increased our storage by 550%, when making that column no dictionary it increased it 288%.

Open in Slack

Previous Next