Hello what would be impact on storage footprint if we set ma Apache Pinot #general

Join Slack

Hello, what would be impact on storage footprint i...

# general

Tanmay Krishna

04/01/2022, 2:16 PM

Hello, what would be impact on storage footprint if we set maxLength of string column(SV) to 1MB?

Tanmay Krishna

04/01/2022, 2:17 PM

Wanted to understand if Pinot allots 1 MB(always) to that column per row or the size of each row is variable depending on the actual length of the column?

Tanmay Krishna

04/01/2022, 2:20 PM

We recently onboarded a table with multiple columns having maxLength as 1MB and we saw our servers crash due to disks being out of space as each segment of the tables was of ~80GB(throughput of 1-2k events/sec and consumed for few hrs). Wanted to know if we are following best practices.

Tanmay Krishna

04/01/2022, 2:22 PM

Also if we have a column which is basically an array of strings of length(0-200), should we define it as a string column with a high enough maxLenght or a MV string column? What would be the difference?

Richard Startin

04/01/2022, 2:34 PM

dictionarized strings will be padded to that length

Richard Startin

04/01/2022, 2:34 PM

so not a good idea if you dictionaries

Richard Startin

04/01/2022, 2:35 PM

but if you really have 1MB strings, there's a good chance they aren't repeated, so having a dictionary would be wasteful

Richard Startin

04/01/2022, 2:35 PM

if that's the case, you can add those columns to the "noDictionaryColumns" in "tableIndexConfig"

Tanmay Krishna

04/01/2022, 2:36 PM

so having a dictionary would be wasteful

Yes dictionaries are not needed they won’t be used in filters or groupBy in queries.

Tanmay Krishna

04/01/2022, 2:47 PM

One doubt. To test this we created two schemas(and tables). Schema1

Copy code

{
  "schemaName": "ztest_schema_max",
  "dimensionFieldSpecs": [
    {
      "name": "column1",
      "dataType": "STRING",
      "maxLength": 1000000
    },
    {
      "name": "column2",
      "dataType": "STRING",
      "maxLength": 1000000
    }
  ],
  "dateTimeFieldSpecs": [
    {
      "name": "producer_timestamp",
      "dataType": "LONG",
      "format": "1:SECONDS:EPOCH",
      "granularity": "1:SECONDS"
    }
  ]
}

Schema2

Copy code

{
  "schemaName": "ztest_schema",
  "dimensionFieldSpecs": [
    {
      "name": "column1",
      "dataType": "STRING"
    },
    {
      "name": "column2",
      "dataType": "STRING"
    }
  ],
  "dateTimeFieldSpecs": [
    {
      "name": "producer_timestamp",
      "dataType": "LONG",
      "format": "1:SECONDS:EPOCH",
      "granularity": "1:SECONDS"
    }
  ]
}

And inserted same data into both these tables. On pinot UI the size is reported as 730kb for both these tables.

Tanmay Krishna

04/01/2022, 2:48 PM

TableIndexConfig for both tables

Copy code

"tableIndexConfig": {
      "enableDefaultStarTree": false,
      "enableDynamicStarTreeCreation": false,
      "aggregateMetrics": false,
      "nullHandlingEnabled": false,
      "rangeIndexVersion": 1,
      "autoGeneratedInvertedIndex": false,
      "createInvertedIndexDuringSegmentGeneration": false,
      "streamConfigs": {
        "streamType": "kafka",
        "stream.kafka.consumer.type": "LowLevel",
        "stream.kafka.topic.name": "events.router.v2.live",
        "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder",
        "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory",
        "stream.kafka.broker.list": "kafka-kafka-bootstrap.kafka.svc.cluster.local:9092",
        "stream.kafka.consumer.prop.auto.offset.reset": "smallest",
        "realtime.segment.flush.threshold.rows": "1",
        "realtime.segment.flush.threshold.time": "1d",
        "realtime.segment.flush.threshold.segment.size": "1m"
      },
      "loadMode": "MMAP"
    }

Total number of rows ingested = 206, Each row is just a duplicate of this.

Copy code

column1	column2	producer_timestamp
router.optimizer_visibility_event	router	1648643989

Tanmay Krishna

04/01/2022, 2:49 PM

dictionarized strings will be padded to that length

Going by this I would expect table with 1MB columns should be significantly larger than the other. But that doesn’t seem to be the case here. Can you please help me understand why?

Tanmay Krishna

04/01/2022, 2:51 PM

Not sure if this might matter but each segment has only 1 row(was testing).

Open in Slack

Previous Next