Hi I am running a query which involves AND OR and with some Apache Pinot #general

Hi, I am running a query which involves AND , OR a...

Syed Akram

05/10/2021, 7:18 AM

Hi, I am running a query which involves AND , OR and with some filters on string and long values. It has basically 34Million rows , and querying(selecting few columns for an ID) takes almost 2 sec and numEntriesScannedInFilter(89Million) & numEntriesScannedPostFilter are bigger values. Can someone help me to understand, how come this many entries scanned in filter, where i am using Inverted index...?

Mayank

05/10/2021, 1:53 PM

Are there any ORs that can be changed into INs? If so try that

Syed Akram

05/10/2021, 2:23 PM

yeah we can do that

Syed Akram

05/10/2021, 2:28 PM

after changing it from OR to IN wherever possible giving same time taken, but record scanned is more this time

Mayank

05/10/2021, 2:44 PM

Can you paste the new query and the records scanned?

Mayank

05/10/2021, 2:44 PM

and also all columns that have sorted/inv index?

Syed Akram

05/10/2021, 3:04 PM

Query: select AUDITLOGID,RECORDID,RECORDNAME,AUDITEDTIME,USERID,ACTIONTYPE,SOURCE,ACTIONINFO,OTHERDETAILS,DONEBY FROM auditlog WHERE ((((((RELATEDID = 553493000165096765) AND (RELATEDMODULE = 'Contacts')) AND (AUDITEDTIME >= 1588214155000)) AND ((((((((MODULE = 'Potentials') AND ((ACTIONTYPE = 19) OR (ACTIONTYPE = 20))) OR ((MODULE = 'Potentials') AND (((((OTHERDETAILS = 'Unattended Dialled') OR (OTHERDETAILS = 'Attended Dialled')) OR (OTHERDETAILS = 'Scheduled Attended')) OR (OTHERDETAILS = 'Received')) OR (OTHERDETAILS = 'Missed')))) OR (ACTIONTYPE IN (36,34,19))) OR ((ACTIONTYPE IN (10,11,1)) AND (SOURCE = 19))) OR ((ACTIONTYPE IN (10,11,19)) AND (SOURCE = 10))) OR ((ACTIONTYPE = 1) AND (MODULE = 'Potentials'))) OR (ACTIONTYPE = 69))))) limit 5000 table config: { "OFFLINE": { "tableName": "auditlog_OFFLINE", "tableType": "OFFLINE", "segmentsConfig": { "replication": "1", "segmentPushType": "REFRESH" }, "tenants": { "broker": "DefaultTenant", "server": "DefaultTenant" }, "tableIndexConfig": { "invertedIndexColumns": [ "RELATEDID", "AUDITLOGID" ], "autoGeneratedInvertedIndex": false, "createInvertedIndexDuringSegmentGeneration": false, "loadMode": "MMAP", "enableDefaultStarTree": false, "enableDynamicStarTreeCreation": false, "aggregateMetrics": false, "nullHandlingEnabled": false }, "metadata": { "customConfigs": {} }, "ingestionConfig": {}, "isDimTable": false } }

Mayank

05/10/2021, 3:05 PM

You didn’t change the otherDetails to IN?

Syed Akram

05/10/2021, 3:05 PM

totalDocs : 34Milion num ofsegments: 1 num of servers: 1 numEntriesScannedInFilter : 89627266

Syed Akram

05/10/2021, 3:05 PM

otherDetails is basically a json string

Syed Akram

05/10/2021, 3:07 PM

i will change that too

Syed Akram

05/10/2021, 3:11 PM

i changed otherDetails too, but giving same result /timetaken and stats

Syed Akram

05/10/2021, 3:12 PM

my main concern is, how come numEntriesScannedInFilter is 89Million which is very high in number

Mayank

05/10/2021, 3:22 PM

You have inv index only on 2 columns?

Syed Akram

05/10/2021, 3:24 PM

yes

Syed Akram

05/10/2021, 3:24 PM

but when i checked metadata.properties, it shows almost all columns have invertedindex

Mayank

05/10/2021, 3:25 PM

Yeah that is misleading. It is only for the two columns you specified

Syed Akram

05/10/2021, 3:25 PM

ohh

Syed Akram

05/10/2021, 3:26 PM

shall i specify the columns which i used to filter in invertedindex?

Syed Akram

05/10/2021, 3:27 PM

ACTIONTYPE, OTHERDETAILS, MODULE, SOURCE

Syed Akram

05/10/2021, 3:27 PM

Mayank

05/10/2021, 3:27 PM

Which ones of them have high cardinality?

Syed Akram

05/10/2021, 3:28 PM

unique values will be more in this in case of OTHERDETAILS column

Mayank

05/10/2021, 3:31 PM

Ok but that is json?

Mayank

05/10/2021, 3:31 PM

As in configured as json for Pinot or Pinot thinks it is just a string?

Syed Akram

05/10/2021, 3:31 PM

its string

Syed Akram

05/10/2021, 3:31 PM

toString()

Mayank

05/10/2021, 3:32 PM

Ok, let’s start with that

Mayank

05/10/2021, 3:32 PM

And if that data size is not that much, may be set for all

Syed Akram

05/10/2021, 3:33 PM

what do u mean by data size?

Mayank

05/10/2021, 3:33 PM

Total data size in Pinot

Syed Akram

05/10/2021, 3:34 PM

Screenshot from 2021-05-10 21-04-27.png

Syed Akram

05/10/2021, 3:35 PM

segment tar.gz size is 639M

Syed Akram

05/10/2021, 4:03 PM

i applied inv index , now timetaken to finish the query took 393ms(earlier it was almost 2sec), but numOfEntriesScanned is 2140274...

Syed Akram

05/10/2021, 4:04 PM

can i reduce numOfEntriesScanned it some more

Mayank

05/10/2021, 4:13 PM

Is this with inv index on all columns?

Mayank

05/10/2021, 4:14 PM

Also I do not see a sorted column

Syed Akram

05/10/2021, 4:52 PM

Yes with 12 columns inv index and no sorted column

Syed Akram

05/10/2021, 4:53 PM

Will apply pk as sorted index and check if entries scanned can reduced

Mayank

05/10/2021, 5:24 PM

Yeah, sorted column will help reduce further.

Mayank

05/10/2021, 5:24 PM

Will the pk be part of most queries? And what kind of predicate would you have on that pk in the queries?

Syed Akram

05/11/2021, 6:16 AM

Yes pk will be part of every query .

Syed Akram

05/11/2021, 6:22 AM

can we apply sorted and inv both index on PK?

Mayank

05/11/2021, 3:05 PM

Sorted is good enough, you don’t need inv

Syed Akram

05/11/2021, 3:06 PM

ok, i applied it, but i see no much difference in timetaken and nmofentriesscannedinfilter

Mayank

05/11/2021, 3:07 PM

What’s the predicate on sorted column?

Mayank

05/11/2021, 3:08 PM

Also can you confirm from metadata.properties that it is actually sorted?

Syed Akram

05/11/2021, 3:09 PM

column.RELATEDID.isSorted = false

Syed Akram

05/11/2021, 3:09 PM

how come?

Syed Akram

05/11/2021, 3:10 PM

i configured it as sortedColumn in tableconfig

Mayank

05/11/2021, 3:10 PM

Is this offline or real-time

Mayank

05/11/2021, 3:10 PM

For offline you need to sort outside of Pinot

Mayank

05/11/2021, 3:10 PM

For realtime, Pinot will sort

Syed Akram

05/11/2021, 3:11 PM

offline

Mayank

05/11/2021, 3:11 PM

Ok, then ensure that the input data from which Pinot segment is generated is sorted

Syed Akram

05/11/2021, 3:11 PM

input is orc file, so i just ingested

Mayank

05/11/2021, 3:11 PM

Will need to write a job to sort input

Mayank

05/11/2021, 3:12 PM

We are working on making this inside of Pinot using Minion, but not ready yet

Mayank

05/11/2021, 3:14 PM

For now just try inv index in the column first. We you already had that, then it won’t help much with latency, might help with throughput if sorted

Syed Akram

05/11/2021, 3:15 PM

Syed Akram

05/11/2021, 3:32 PM

when can we expect to release minion

Open in Slack

Previous Next