We recently updated our Pinot cluster from a patch...
# troubleshooting
k
We recently updated our Pinot cluster from a patched version of 0.9 to 0.10. The following query now returns different (and incorrect) results:
Copy code
SELECT sum(metric) AS sumMetric, key  
FROM table 
WHERE dim1 = 'xx' AND dim2 >= 19144 AND dim2 <= 19173 
AND dim3 NOT IN ('yy', 'zz') 
GROUP BY key ORDER BY sumMetric DESC LIMIT 3
Previously the third result was:
Copy code
1.7132548232917935E7  key3
but now it’s
Copy code
1.5662814895781398E7   key4
However doing an explicit query on key3 with:
Copy code
SELECT sum(metric) AS sumMetric, key  
FROM table 
WHERE dim1 = 'xx' AND dim2 >= 19144 AND dim2 <= 19173 
AND dim3 NOT IN ('yy', 'zz')
AND key = 'key3'
returns the previous sum for key3 of
1.7132548232917935E7
, so it still should be the third result. This behavior is the same regardless of whether we add
OPTION(segmentMinTrimSize=1000)
.
I’m wondering if anyone knows of a change in 0.10 that could cause this? And/or suggestions for how to trouble-shoot? Note that the segments being served by the old and new versions of Pinot should be the same. Also we create (batch mode) segments partitioned such that all matching records will be in a set of 10 segments, and all records for a given key value will be in the same segment.
Given the above, I wasn’t expecting changing segmentMinTrimSize would fix the issue, but wanted to cover that possibility.
I’m wondering if there’s some PR I merged into the 0.9 branch that didn’t make it into 0.10
r
interesting so basically it means agg only query returns correct result but agg group by didn't
just to clarify when you said but now it’s
Copy code
1.5662814895781398E7   key4
you really meant
key3
yes?
k
Right. If I add an additional filter on a dimension (dim4) where every value of key3 will have the same value for dim4, I get the correct results
No, it’s a new key that has a lower sum than key3
So instead of getting key3 (with patched Pinot 0.9), I get key4
r
at the 3rd ordered value
k
Even though sum(metric) for key3 is > sum(metric) for key4
Right
r
oh. so it is problem of order by then
k
And even if I set LIMIT 10000, that result set doesn’t contain key3
r
oh... so key3 is MISSING entirely
k
Right, at least for top 10000 groups
I see that there are a bit more than 10K records that match key3 and the other filters in the query.
r
could you try
sum(CAST(metrics AS DOUBLE)
k
Same result
r
haha worth a short
k
OK, something weird. If I remove the
AND dim3 NOT IN ('yy', 'zz')
then I get the old result
(the correct result)
r
interesting. how about replace it with
(dim3 = 'yy' or dim3 = 'zz')
^ i meant not equal with and clause.
k
Right - I changed to
AND dim3 != 'yy' AND dim3 != 'zz'
but get the same (incorrect) result
Also
AND dim3 NOT IN ('yy', 'zz')
isn’t filtering out results that changed (or changed significantly) the sum(metric) for the top 3 results.
And in trace output,
"numGroupsLimitReached": false,
r
if you only put
dim3 != 'yy'
?
if it returns something with key3... i felt like the range predicate rewrite rule might be broken
k
I don’t get key3 as the #3 result (so same new/incorrect results)
I can do another deep query (LIMIT 10000) and see if key3 is in that set
r
ok. does dim3 have index?
k
Checking…
Yes - it’s in the
invertedIndexColumns
set of dimension fields
A few more odd details - if I look at a server log when using the dim3 filter, I see:
Copy code
numDocsScanned=16484299,
scanInFilter=0,
scanPostFilter=32968598,
But when I remove the dim3 filter line, I get:
Copy code
numDocsScanned=762028,
scanInFilter=0,
scanPostFilter=1524056,
I don’t understand why
numDocsScanned
is so much higher when I add a filter - I would assume it went the other direction.
I feel like there were changes (e.g. https://github.com/apache/pinot/pull/6991) that I thought I’d patched into our older Pinot build, so I wouldn’t expect that to have caused a change.
r
@Jackie
j
Do you have star-tree index for this table?
What is the cardinality of the key column?
One thing worth noting is that
numGroupsLimitReached
is not properly set until release 0.11 (PR: https://github.com/apache/pinot/pull/8393)
k
@Jackie - yes, we have a star-tree index. The cardinality for the key column is very high (I would guess 2M per segment).
@Jackie - but what would change with having versus not having the the
NOT IN
filter? The dimension being filtered on in the
NOT IN
clause (
dim3
) is not part of the star tree, whereas
dim1
,
dim2
and
key
are in the
dimensionsSplitOrder
list. Don’t know if it matters, but
dim1
is in the
skipStarNodeCreationForDimensions
list.
j
That is fine. Because
dim3
is not included in the split order, we won't be able to use star-tree to solve the query if there is a filter on
dim3
, which is the reason why it you see much higher
numDocsScanned
I suspect the problem is from reaching the groups limit. Can you try increasing the groups limit on the server and see if the result changes?
k
I’m using
OPTION(minServerGroupTrimSize=1000000)
, shouldn’t that do it?
j
No, that won't change the groups limit. You'll have to configure it on the server under the key `pinot.server.query.executor.num.groups.limit`: https://docs.pinot.apache.org/configuration-reference/server
k
Also on https://docs.pinot.apache.org/users/user-guide-query/grouping-algorithm I see
pinot.server.query.executor.num.groups.limit
, which doesn’t seem to have a query override?
j
Yeah, we don't have query override for that as of now
k
OK, that’s promising. When was
pinot.server.query.executor.num.groups.limit
introduced?
j
That is introduced before 0.9, but not sure if it is handled the same way in 0.9 vs 0.10. So we want to first verify if that is causing the different result
k
I checked the server logs, and see:
Copy code
./pinotServer-03-17-2022-2.log:2022/03/17 08:58:19.022 INFO [InstancePlanMakerImplV2] [Start a Pinot [SERVER]] Initializing plan maker with maxInitialResultHolderCapacity: 10000, numGroupsLimit: 100000, enableSegmentTrim: false, minSegmentGroupTrimSize: -1
So based on https://github.com/apache/pinot/issues/8089, would I also need to set the
groupby.trim.threshold
property?
j
No, the trim threshold can be overridden by query if needed
To get absolute accurate result, you will need to remove the groups limit as well as the segment/server trim. But I feel removing the groups limit should be able to get you the correct result as long as the data distribution is not too skewed
k
I seem to remember discussing with you whether the server process can/should use a priority queue when calculating groups and there’s an ORDER BY…is that still pending?
j
I think you are referring to the result trimming within the segment. That is not added yet. We need to evaluate the cost of it
k
I’m still surprised that adding in the
AND dim3 NOT IN ('yy', 'zz')
filter triggers incorrect results, since it doesn’t seem like adding this filter changes the summed value for the top 3 hits. Any thoughts on that?
j
It can change the results if the groups limit is reached. When star-tree is used, we use the pre-aggregated records, which have different order with the raw records. Depending on which key returned first, we will ignore the remaining keys after the limit is reached
k
Thanks, that’s very interesting! And that would explain why so many more records were scanned with the filter, since the filter’s dimension isn’t part of the star-tree definition, so it’s a full scan of the segment.
@Jackie - thanks so much, that was it!!!
👍 2