Hello, what would be impact on storage footprint i...
# general
t
Hello, what would be impact on storage footprint if we set maxLength of string column(SV) to 1MB?
Wanted to understand if Pinot allots 1 MB(always) to that column per row or the size of each row is variable depending on the actual length of the column?
We recently onboarded a table with multiple columns having maxLength as 1MB and we saw our servers crash due to disks being out of space as each segment of the tables was of ~80GB(throughput of 1-2k events/sec and consumed for few hrs). Wanted to know if we are following best practices.
Also if we have a column which is basically an array of strings of length(0-200), should we define it as a string column with a high enough maxLenght or a MV string column? What would be the difference?
r
dictionarized strings will be padded to that length
so not a good idea if you dictionaries
but if you really have 1MB strings, there's a good chance they aren't repeated, so having a dictionary would be wasteful
if that's the case, you can add those columns to the "noDictionaryColumns" in "tableIndexConfig"
t
so having a dictionary would be wasteful
Yes dictionaries are not needed they won’t be used in filters or groupBy in queries.
One doubt. To test this we created two schemas(and tables). Schema1
Copy code
{
  "schemaName": "ztest_schema_max",
  "dimensionFieldSpecs": [
    {
      "name": "column1",
      "dataType": "STRING",
      "maxLength": 1000000
    },
    {
      "name": "column2",
      "dataType": "STRING",
      "maxLength": 1000000
    }
  ],
  "dateTimeFieldSpecs": [
    {
      "name": "producer_timestamp",
      "dataType": "LONG",
      "format": "1:SECONDS:EPOCH",
      "granularity": "1:SECONDS"
    }
  ]
}
Schema2
Copy code
{
  "schemaName": "ztest_schema",
  "dimensionFieldSpecs": [
    {
      "name": "column1",
      "dataType": "STRING"
    },
    {
      "name": "column2",
      "dataType": "STRING"
    }
  ],
  "dateTimeFieldSpecs": [
    {
      "name": "producer_timestamp",
      "dataType": "LONG",
      "format": "1:SECONDS:EPOCH",
      "granularity": "1:SECONDS"
    }
  ]
}
And inserted same data into both these tables. On pinot UI the size is reported as 730kb for both these tables.
TableIndexConfig for both tables
Copy code
"tableIndexConfig": {
      "enableDefaultStarTree": false,
      "enableDynamicStarTreeCreation": false,
      "aggregateMetrics": false,
      "nullHandlingEnabled": false,
      "rangeIndexVersion": 1,
      "autoGeneratedInvertedIndex": false,
      "createInvertedIndexDuringSegmentGeneration": false,
      "streamConfigs": {
        "streamType": "kafka",
        "stream.kafka.consumer.type": "LowLevel",
        "stream.kafka.topic.name": "events.router.v2.live",
        "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder",
        "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory",
        "stream.kafka.broker.list": "kafka-kafka-bootstrap.kafka.svc.cluster.local:9092",
        "stream.kafka.consumer.prop.auto.offset.reset": "smallest",
        "realtime.segment.flush.threshold.rows": "1",
        "realtime.segment.flush.threshold.time": "1d",
        "realtime.segment.flush.threshold.segment.size": "1m"
      },
      "loadMode": "MMAP"
    }
Total number of rows ingested = 206, Each row is just a duplicate of this.
Copy code
column1	column2	producer_timestamp
router.optimizer_visibility_event	router	1648643989
dictionarized strings will be padded to that length
Going by this I would expect table with 1MB columns should be significantly larger than the other. But that doesn’t seem to be the case here. Can you please help me understand why?
Not sure if this might matter but each segment has only 1 row(was testing).