what s the numDocsScanned Apache Pinot #troubleshooting

Join Slack

what's the numDocsScanned?

# troubleshooting

Mayank

08/20/2020, 10:13 PM

what's the numDocsScanned?

Buchi Reddy

08/20/2020, 10:14 PM

Copy code

{
  "resultTable": {
    "dataSchema": {
      "columnDataTypes": [
        "INT"
      ],
      "columnNames": [
        "distinctcount(service_id)"
      ]
    },
    "rows": [
      [
        2
      ]
    ]
  },
  "exceptions": [],
  "numServersQueried": 4,
  "numServersResponded": 4,
  "numSegmentsQueried": 180,
  "numSegmentsProcessed": 180,
  "numSegmentsMatched": 179,
  "numConsumingSegmentsQueried": 8,
  "numDocsScanned": 127901110,
  "numEntriesScannedInFilter": 131526370,
  "numEntriesScannedPostFilter": 127901110,
  "numGroupsLimitReached": false,
  "totalDocs": 128941076,
  "timeUsedMs": 3339,
  "segmentStatistics": [],
  "traceInfo": {},
  "minConsumingFreshnessTimeMs": 1597961648907
}

Buchi Reddy

08/20/2020, 10:14 PM

127M, which is for sure high 🙂

Mayank

08/20/2020, 10:15 PM

Yep

Buchi Reddy

08/20/2020, 10:20 PM

So, there is nothing we can do? Looks like we should not use Pinot for such queries.

Mayank

08/20/2020, 10:21 PM

Can your data be partitioned?

Mayank

08/20/2020, 10:22 PM

I am not aware of any optimizations specific to distinct/hll. However, you can check if there are other optimizations like partitioning that you are using already.

Mayank

08/20/2020, 10:22 PM

What other tools do you have in mind for such queries, apart from Pinot?

Buchi Reddy

08/20/2020, 10:26 PM

I think it’s use case dependent. In this particular query, we have data in MongoDB as well and it’s just about getting doc count with a filter. MongoDB is updating the same record, hence it’s just count, not distinct count there.

Mayank

08/20/2020, 10:27 PM

Hmm,

aggregate-metrics

does that in pinot too, right?

Mayank

08/20/2020, 10:27 PM

If we see same dimension values, then we increment metric columns

Mayank

08/20/2020, 10:28 PM

Although, that only gives distinct per segment

Mayank

08/20/2020, 10:28 PM

Yeah, so you will get higher counts (one per segment)

Mayank

08/20/2020, 10:31 PM

My expectation is that same record is being overwitten in MongoDB, then the overall numDocs scanned should be 1 per segment (for a unique combination of dimensions), in which case pinot will also become fast when using

aggregate-metrics

. However, if you still see 120M records in MongoDB, it might also have to scan the same amount of data.

Buchi Reddy

08/20/2020, 10:43 PM

what’s

aggregate-metrics

Mayank

08/20/2020, 10:57 PM

Is a table config that allows metric columsn to be aggregated (summed) for unique combination for dimension columns. I was trying to add this on docs and was having trouble a couple of weeks ago. I thought @Kishore G added it?

Copy code

In the following example, we enable aggregateMetrics by setting it to true in tableIndexConfig. Note, that all the metrics (count in this case) have to be noDictionaryColumns. Also, note that even though the dimension country is defined as noDictionaryColumn, the aggregateMetrics setting will take precedence and the dimension country will use dictionary based indexing.
"tableIndexConfig": {
  "aggregateMetrics": true,
  "noDictionaryColumns": [
    "country",
    "count"
  ],
  "streamConfigs": {
  ...
}

Kishore G

08/21/2020, 2:13 AM

Are you trying to get distinct services for a given time period?

Open in Slack

Previous Next