Ken Krugler
09/28/2022, 3:25 PMSELECT sum(metric) AS sumMetric, key
FROM table
WHERE dim1 = 'xx' AND dim2 >= 19144 AND dim2 <= 19173
AND dim3 NOT IN ('yy', 'zz')
GROUP BY key ORDER BY sumMetric DESC LIMIT 3
Ken Krugler
09/28/2022, 3:30 PM1.7132548232917935E7 key3
but now it’s
1.5662814895781398E7 key4
However doing an explicit query on key3 with:
SELECT sum(metric) AS sumMetric, key
FROM table
WHERE dim1 = 'xx' AND dim2 >= 19144 AND dim2 <= 19173
AND dim3 NOT IN ('yy', 'zz')
AND key = 'key3'
returns the previous sum for key3 of 1.7132548232917935E7
, so it still should be the third result.
This behavior is the same regardless of whether we add OPTION(segmentMinTrimSize=1000)
.Ken Krugler
09/28/2022, 3:33 PMKen Krugler
09/28/2022, 3:34 PMKen Krugler
09/28/2022, 4:35 PMRong R
09/28/2022, 4:43 PMRong R
09/28/2022, 4:43 PM1.5662814895781398E7 key4
you really meant key3
yes?Ken Krugler
09/28/2022, 4:44 PMKen Krugler
09/28/2022, 4:44 PMKen Krugler
09/28/2022, 4:45 PMRong R
09/28/2022, 4:45 PMKen Krugler
09/28/2022, 4:45 PMKen Krugler
09/28/2022, 4:45 PMRong R
09/28/2022, 4:45 PMKen Krugler
09/28/2022, 4:45 PMRong R
09/28/2022, 4:46 PMKen Krugler
09/28/2022, 4:46 PMKen Krugler
09/28/2022, 4:46 PMRong R
09/28/2022, 4:46 PMsum(CAST(metrics AS DOUBLE)
Ken Krugler
09/28/2022, 4:47 PMRong R
09/28/2022, 4:47 PMKen Krugler
09/28/2022, 4:48 PMAND dim3 NOT IN ('yy', 'zz')
then I get the old resultKen Krugler
09/28/2022, 4:48 PMRong R
09/28/2022, 4:49 PM(dim3 = 'yy' or dim3 = 'zz')
Rong R
09/28/2022, 4:49 PMKen Krugler
09/28/2022, 4:50 PMAND dim3 != 'yy' AND dim3 != 'zz'
but get the same (incorrect) resultKen Krugler
09/28/2022, 4:52 PMAND dim3 NOT IN ('yy', 'zz')
isn’t filtering out results that changed (or changed significantly) the sum(metric) for the top 3 results.Ken Krugler
09/28/2022, 4:54 PM"numGroupsLimitReached": false,
Rong R
09/28/2022, 4:54 PMdim3 != 'yy'
?Rong R
09/28/2022, 4:54 PMKen Krugler
09/28/2022, 4:55 PMKen Krugler
09/28/2022, 4:56 PMRong R
09/28/2022, 4:56 PMKen Krugler
09/28/2022, 4:57 PMKen Krugler
09/28/2022, 4:58 PMinvertedIndexColumns
set of dimension fieldsKen Krugler
09/28/2022, 5:17 PMnumDocsScanned=16484299,
scanInFilter=0,
scanPostFilter=32968598,
But when I remove the dim3 filter line, I get:
numDocsScanned=762028,
scanInFilter=0,
scanPostFilter=1524056,
I don’t understand why numDocsScanned
is so much higher when I add a filter - I would assume it went the other direction.Ken Krugler
09/28/2022, 5:39 PMRong R
09/28/2022, 5:58 PMJackie
09/28/2022, 7:43 PMJackie
09/28/2022, 7:45 PMJackie
09/28/2022, 7:46 PMnumGroupsLimitReached
is not properly set until release 0.11 (PR: https://github.com/apache/pinot/pull/8393)Ken Krugler
09/28/2022, 8:43 PMKen Krugler
09/28/2022, 8:50 PMNOT IN
filter? The dimension being filtered on in the NOT IN
clause (dim3
) is not part of the star tree, whereas dim1
, dim2
and key
are in the dimensionsSplitOrder
list. Don’t know if it matters, but dim1
is in the skipStarNodeCreationForDimensions
list.Jackie
09/28/2022, 9:10 PMdim3
is not included in the split order, we won't be able to use star-tree to solve the query if there is a filter on dim3
, which is the reason why it you see much higher numDocsScanned
Jackie
09/28/2022, 9:13 PMKen Krugler
09/28/2022, 9:14 PMOPTION(minServerGroupTrimSize=1000000)
, shouldn’t that do it?Jackie
09/28/2022, 9:15 PMKen Krugler
09/28/2022, 9:15 PMpinot.server.query.executor.num.groups.limit
, which doesn’t seem to have a query override?Jackie
09/28/2022, 9:15 PMKen Krugler
09/28/2022, 9:16 PMpinot.server.query.executor.num.groups.limit
introduced?Jackie
09/28/2022, 9:17 PMKen Krugler
09/28/2022, 9:20 PM./pinotServer-03-17-2022-2.log:2022/03/17 08:58:19.022 INFO [InstancePlanMakerImplV2] [Start a Pinot [SERVER]] Initializing plan maker with maxInitialResultHolderCapacity: 10000, numGroupsLimit: 100000, enableSegmentTrim: false, minSegmentGroupTrimSize: -1
So based on https://github.com/apache/pinot/issues/8089, would I also need to set the groupby.trim.threshold
property?Jackie
09/28/2022, 9:22 PMJackie
09/28/2022, 9:23 PMKen Krugler
09/28/2022, 9:24 PMJackie
09/28/2022, 9:32 PMKen Krugler
09/28/2022, 9:39 PMAND dim3 NOT IN ('yy', 'zz')
filter triggers incorrect results, since it doesn’t seem like adding this filter changes the summed value for the top 3 hits. Any thoughts on that?Jackie
09/28/2022, 9:49 PMKen Krugler
09/28/2022, 10:11 PMKen Krugler
09/29/2022, 5:40 PM