Hello -- I created a default real time table . Aft...
# general
a
Hello -- I created a default real time table . After consuming some 300k events i wanted to add a sorted inverted so i edited the table to add the sorted col. How can I check if my query is using the index and whether the index is successfully created or not.?
m
For segments flushed to disk you can check metadata.properties file. If the column is marked as sorted, you can assume sorted index will be used.
a
Hi Mayank -- this is the metadata from one of the latest segments, but i dont see any sorted column here..
Copy code
{
  "segment.realtime.endOffset": "1209967",
  "segment.start.time": "1621347175827",
  "segment.time.unit": "MILLISECONDS",
  "segment.flush.threshold.size": "50000",
  "segment.realtime.startOffset": "1159967",
  "segment.end.time": "1621347449397",
  "segment.total.docs": "50000",
  "segment.table.name": "schd_1",
  "segment.realtime.numReplicas": "1",
  "segment.creation.time": "1621347177149",
  "segment.realtime.download.url": "<http://172.18.0.2:9000/segments/schd_1/schd_1__0__34__20210518T1412Z>",
  "segment.name": "schd_1__0__34__20210518T1412Z",
  "segment.index.version": "v3",
  "segment.flush.threshold.time": null,
  "segment.type": "REALTIME",
  "segment.crc": "2855665894",
  "segment.realtime.status": "DONE"
}
how do I ensure the sorted index is created and put to use ?
Copy code
{
  "REALTIME": {
    "tableName": "schd_1_REALTIME",
    "tableType": "REALTIME",
    "segmentsConfig": {
      "timeType": "MILLISECONDS",
      "schemaName": "schd",
      "timeColumnName": "upd_ts",
      "segmentAssignmentStrategy": "BalanceNumSegmentAssignmentStrategy",
      "segmentPushType": "APPEND",
      "replicasPerPartition": "1"
    },
    "tenants": {
      "broker": "DefaultTenant",
      "server": "DefaultTenant"
    },
    "tableIndexConfig": {
      "loadMode": "MMAP",
      "streamConfigs": {
        "streamType": "kafka",
        "stream.kafka.consumer.type": "lowLevel",
        "stream.kafka.topic.name": "schd",
        "stream.kafka.consumer.prop.auto.offset.reset": "smallest",
        "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder",
        "stream.kafka.hlc.zk.connect.string": "localhost:2191/kafka",
        "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory",
        "stream.kafka.zk.broker.url": "localhost:2191/kafka",
        "stream.kafka.broker.list": "localhost:19092",
        "realtime.segment.flush.threshold.rows": "50000",
        "realtime.segment.flush.threshold.time": "10m"
      },
      "enableDefaultStarTree": false,
      "autoGeneratedInvertedIndex": false,
      "createInvertedIndexDuringSegmentGeneration": false,
      "sortedColumn": [
        "post_prd_id"
      ],
      "enableDynamicStarTreeCreation": false,
      "aggregateMetrics": false,
      "nullHandlingEnabled": false
    },
    "metadata": {
      "customConfigs": {}
    },
    "routing": {
      "instanceSelectorType": "strictReplicaGroup"
    },
    "upsertConfig": {
      "mode": "FULL"
    },
    "isDimTable": false
  }
}
this is my table config
n
This metadata is from zookeeper. You need to check the metadata.properties file, which you will find on the server, inside each segment dir
m
Yes ^^
n
And for older completed segments, any indexing change in table config will only reflect after a segment reload API invocation. However, I think sorted index cannot be applied this way to old segments
a
where can i find the location of these files ? I am using the docker image.
m
The data dir of the server
a
got it
Copy code
column.post_prd_id.columnType = DIMENSION
column.post_prd_id.isSorted = true
m
Yeah, if column is sorted then You can assume it is being used
a
got it, i assume there is no way to get the query plan since its dynamic for every segment, right?
m
Yeah, right now there isn’t a way to get query plan, mostly because pinot doesn’t support complex joins or nested queries
a
got it.. makes sense..
Copy code
timeUsedMs	numDocsScanned	totalDocs	numServersQueried	numServersResponded	numSegmentsQueried	numSegmentsProcessed	numSegmentsMatched	numConsumingSegmentsQueried	numEntriesScannedInFilter	numEntriesScannedPostFilter	numGroupsLimitReached	partialResponse	minConsumingFreshnessTimeMs	offlineThreadCpuTimeNs	realtimeThreadCpuTimeNs
40	31741	1584379	1	1	43	43	34	1	928729	95223	false	-	1621350158012	0	0
Where can I read about what each of these mean ?
m
one sec
The search in docs.pinot.apache.org is pretty good, and should be able to point you to any docs related to terms you query.
a
got it.. sorry about that.. i was wondering what is the diff between Docs and Entries ?
docs is the actual record, but what does numEntries mean?
m
Doc represents a record
Entry represents a value for a column in the record.
a
ok.. so if i see that numEntriesScannedInFilter is really high for a low cardinality col filter, would that mean it's better to have an inverted index on that?
m
If the cardinality is very low (say gender - M/F/U), then adding inv index only prunes out 2/3 or the data. Depending on your case and query latency requirement, it might still be a good idea.
a
right.. understood..