Hi, I am running a query which involves AND , OR a...
# general
s
Hi, I am running a query which involves AND , OR and with some filters on string and long values. It has basically 34Million rows , and querying(selecting few columns for an ID) takes almost 2 sec and numEntriesScannedInFilter(89Million) & numEntriesScannedPostFilter are bigger values. Can someone help me to understand, how come this many entries scanned in filter, where i am using Inverted index...?
m
Are there any ORs that can be changed into INs? If so try that
s
yeah we can do that
after changing it from OR to IN wherever possible giving same time taken, but record scanned is more this time
m
Can you paste the new query and the records scanned?
and also all columns that have sorted/inv index?
s
Query: select AUDITLOGID,RECORDID,RECORDNAME,AUDITEDTIME,USERID,ACTIONTYPE,SOURCE,ACTIONINFO,OTHERDETAILS,DONEBY FROM auditlog WHERE ((((((RELATEDID = 553493000165096765) AND (RELATEDMODULE = 'Contacts')) AND (AUDITEDTIME >= 1588214155000)) AND ((((((((MODULE = 'Potentials') AND ((ACTIONTYPE = 19) OR (ACTIONTYPE = 20))) OR ((MODULE = 'Potentials') AND (((((OTHERDETAILS = 'Unattended Dialled') OR (OTHERDETAILS = 'Attended Dialled')) OR (OTHERDETAILS = 'Scheduled Attended')) OR (OTHERDETAILS = 'Received')) OR (OTHERDETAILS = 'Missed')))) OR (ACTIONTYPE IN (36,34,19))) OR ((ACTIONTYPE IN (10,11,1)) AND (SOURCE = 19))) OR ((ACTIONTYPE IN (10,11,19)) AND (SOURCE = 10))) OR ((ACTIONTYPE = 1) AND (MODULE = 'Potentials'))) OR (ACTIONTYPE = 69))))) limit 5000 table config: { "OFFLINE": { "tableName": "auditlog_OFFLINE", "tableType": "OFFLINE", "segmentsConfig": { "replication": "1", "segmentPushType": "REFRESH" }, "tenants": { "broker": "DefaultTenant", "server": "DefaultTenant" }, "tableIndexConfig": { "invertedIndexColumns": [ "RELATEDID", "AUDITLOGID" ], "autoGeneratedInvertedIndex": false, "createInvertedIndexDuringSegmentGeneration": false, "loadMode": "MMAP", "enableDefaultStarTree": false, "enableDynamicStarTreeCreation": false, "aggregateMetrics": false, "nullHandlingEnabled": false }, "metadata": { "customConfigs": {} }, "ingestionConfig": {}, "isDimTable": false } }
m
You didn’t change the otherDetails to IN?
s
totalDocs : 34Milion num ofsegments: 1 num of servers: 1 numEntriesScannedInFilter : 89627266
otherDetails is basically a json string
i will change that too
i changed otherDetails too, but giving same result /timetaken and stats
my main concern is, how come numEntriesScannedInFilter is 89Million which is very high in number
m
You have inv index only on 2 columns?
s
yes
but when i checked metadata.properties, it shows almost all columns have invertedindex
m
Yeah that is misleading. It is only for the two columns you specified
s
ohh
shall i specify the columns which i used to filter in invertedindex?
ACTIONTYPE, OTHERDETAILS, MODULE, SOURCE
?
m
Which ones of them have high cardinality?
s
unique values will be more in this in case of OTHERDETAILS column
m
Ok but that is json?
As in configured as json for Pinot or Pinot thinks it is just a string?
s
its string
toString()
m
Ok, let’s start with that
And if that data size is not that much, may be set for all
s
what do u mean by data size?
m
Total data size in Pinot
s
Screenshot from 2021-05-10 21-04-27.png
segment tar.gz size is 639M
i applied inv index , now timetaken to finish the query took 393ms(earlier it was almost 2sec), but numOfEntriesScanned is 2140274...
can i reduce numOfEntriesScanned it some more
m
Is this with inv index on all columns?
Also I do not see a sorted column
s
Yes with 12 columns inv index and no sorted column
Will apply pk as sorted index and check if entries scanned can reduced
m
Yeah, sorted column will help reduce further.
Will the pk be part of most queries? And what kind of predicate would you have on that pk in the queries?
s
Yes pk will be part of every query .
can we apply sorted and inv both index on PK?
m
Sorted is good enough, you don’t need inv
s
ok, i applied it, but i see no much difference in timetaken and nmofentriesscannedinfilter
m
What’s the predicate on sorted column?
Also can you confirm from metadata.properties that it is actually sorted?
s
column.RELATEDID.isSorted = false
how come?
i configured it as sortedColumn in tableconfig
m
Is this offline or real-time
For offline you need to sort outside of Pinot
For realtime, Pinot will sort
s
offline
m
Ok, then ensure that the input data from which Pinot segment is generated is sorted
s
input is orc file, so i just ingested
m
Will need to write a job to sort input
We are working on making this inside of Pinot using Minion, but not ready yet
For now just try inv index in the column first. We you already had that, then it won’t help much with latency, might help with throughput if sorted
s
ok
when can we expect to release minion