Hi Team, I am trying to create hour based segment...
# general
v
Hi Team, I am trying to create hour based segments in pinot but it's creating more than one folder into segments for the same hour, I guess this is due to some default row/data size, can I modify these default configurations and how what it preferable size of the data segment in pinot, what is the philosophy here too many files with a small size or minimum file with a decent size any reference on above
schema:
Copy code
{
  "schemaName": "svd",
  "dimensionFieldSpecs": [
    {
      "name": "serviceId",
      "dataType": "STRING"
    },
    {
      "name": "currentCity",
      "dataType": "STRING"
    },
    {
      "name": "currentCluster",
      "dataType": "STRING"
    },
    {
      "name": "phone",
      "dataType": "STRING"
    },
    {
      "name": "epoch",
      "dataType": "LONG"
    }
  ],
  "metricFieldSpecs": [
    {
      "name": "surge",
      "dataType": "DOUBLE"
    },
    {
      "name": "subTotal",
      "dataType": "DOUBLE"
    }
  ],
  "dateTimeFieldSpecs": [
    {
      "name": "dateString",
      "dataType": "STRING",
      "format": "1:DAYS:SIMPLE_DATE_FORMAT:yyyy-MM-dd-HH",
      "granularity": "1:DAYS"
    }
  ]
}
table config
Copy code
{
  "tableName": "svd",
  "ingestionConfig": {
    "transformConfigs": [
      {
        "columnName": "dateString",
        "transformFunction": "toDateTime(epoch, 'yyyy-MM-dd-HH')"
      }
    ]
  },
  "segmentsConfig": {
    "timeColumnName": "dateString",
    "timeType": "MILLISECONDS",
    "replication": "1",
    "schemaName": "svd"
  },
  "tableIndexConfig": {
    "invertedIndexColumns": [
      "serviceId"
    ],
    "loadMode": "MMAP",
    "segmentPartitionConfig": {
      "columnPartitionMap": {
        "currentCity": {
          "functionName": "Murmur",
          "numPartitions": 4
        }
      }
    }
  },
  "routing": {
    "segmentPrunerTypes": [
      "partition"
    ]
  },
  "tenants": {
    "broker": "DefaultTenant",
    "server": "DefaultTenant"
  },
  "tableType": "OFFLINE",
  "metadata": {}
}
o
For offline tables, you have to configure number of rows in your output file (that can be converted to segment later). Pinot just converts input file to segment, and one file is equal to the one segment. For your realtime tables; you can check configurations https://docs.pinot.apache.org/basics/data-import/pinot-stream-ingestion
1