what's the numDocsScanned?
# troubleshooting
m
what's the numDocsScanned?
b
Copy code
{
  "resultTable": {
    "dataSchema": {
      "columnDataTypes": [
        "INT"
      ],
      "columnNames": [
        "distinctcount(service_id)"
      ]
    },
    "rows": [
      [
        2
      ]
    ]
  },
  "exceptions": [],
  "numServersQueried": 4,
  "numServersResponded": 4,
  "numSegmentsQueried": 180,
  "numSegmentsProcessed": 180,
  "numSegmentsMatched": 179,
  "numConsumingSegmentsQueried": 8,
  "numDocsScanned": 127901110,
  "numEntriesScannedInFilter": 131526370,
  "numEntriesScannedPostFilter": 127901110,
  "numGroupsLimitReached": false,
  "totalDocs": 128941076,
  "timeUsedMs": 3339,
  "segmentStatistics": [],
  "traceInfo": {},
  "minConsumingFreshnessTimeMs": 1597961648907
}
127M, which is for sure high 🙂
m
Yep
b
So, there is nothing we can do? Looks like we should not use Pinot for such queries.
m
Can your data be partitioned?
I am not aware of any optimizations specific to distinct/hll. However, you can check if there are other optimizations like partitioning that you are using already.
What other tools do you have in mind for such queries, apart from Pinot?
b
I think it’s use case dependent. In this particular query, we have data in MongoDB as well and it’s just about getting doc count with a filter. MongoDB is updating the same record, hence it’s just count, not distinct count there.
m
Hmm,
aggregate-metrics
does that in pinot too, right?
If we see same dimension values, then we increment metric columns
Although, that only gives distinct per segment
Yeah, so you will get higher counts (one per segment)
My expectation is that same record is being overwitten in MongoDB, then the overall numDocs scanned should be 1 per segment (for a unique combination of dimensions), in which case pinot will also become fast when using
aggregate-metrics
. However, if you still see 120M records in MongoDB, it might also have to scan the same amount of data.
b
what’s
aggregate-metrics
?
m
Is a table config that allows metric columsn to be aggregated (summed) for unique combination for dimension columns. I was trying to add this on docs and was having trouble a couple of weeks ago. I thought @Kishore G added it?
Copy code
In the following example, we enable aggregateMetrics by setting it to true in tableIndexConfig. Note, that all the metrics (count in this case) have to be noDictionaryColumns. Also, note that even though the dimension country is defined as noDictionaryColumn, the aggregateMetrics setting will take precedence and the dimension country will use dictionary based indexing.
"tableIndexConfig": {
  "aggregateMetrics": true,
  "noDictionaryColumns": [
    "country",
    "count"
  ],
  "streamConfigs": {
  ...
}
k
Are you trying to get distinct services for a given time period?