We recently updated our Pinot cluster from a patched version Apache Pinot #troubleshooting

We recently updated our Pinot cluster from a patch...

Ken Krugler

09/28/2022, 3:25 PM

We recently updated our Pinot cluster from a patched version of 0.9 to 0.10. The following query now returns different (and incorrect) results:

Copy code

SELECT sum(metric) AS sumMetric, key  
FROM table 
WHERE dim1 = 'xx' AND dim2 >= 19144 AND dim2 <= 19173 
AND dim3 NOT IN ('yy', 'zz') 
GROUP BY key ORDER BY sumMetric DESC LIMIT 3

Ken Krugler

09/28/2022, 3:30 PM

Previously the third result was:

Copy code

1.7132548232917935E7  key3

but now it’s

Copy code

1.5662814895781398E7   key4

However doing an explicit query on key3 with:

Copy code

SELECT sum(metric) AS sumMetric, key  
FROM table 
WHERE dim1 = 'xx' AND dim2 >= 19144 AND dim2 <= 19173 
AND dim3 NOT IN ('yy', 'zz')
AND key = 'key3'

returns the previous sum for key3 of

1.7132548232917935E7

, so it still should be the third result. This behavior is the same regardless of whether we add

OPTION(segmentMinTrimSize=1000)

Ken Krugler

09/28/2022, 3:33 PM

I’m wondering if anyone knows of a change in 0.10 that could cause this? And/or suggestions for how to trouble-shoot? Note that the segments being served by the old and new versions of Pinot should be the same. Also we create (batch mode) segments partitioned such that all matching records will be in a set of 10 segments, and all records for a given key value will be in the same segment.

Ken Krugler

09/28/2022, 3:34 PM

Given the above, I wasn’t expecting changing segmentMinTrimSize would fix the issue, but wanted to cover that possibility.

Ken Krugler

09/28/2022, 4:35 PM

I’m wondering if there’s some PR I merged into the 0.9 branch that didn’t make it into 0.10

Rong R

09/28/2022, 4:43 PM

interesting so basically it means agg only query returns correct result but agg group by didn't

Rong R

09/28/2022, 4:43 PM

just to clarify when you said but now it’s

Copy code

1.5662814895781398E7   key4

you really meant

key3

yes?

Ken Krugler

09/28/2022, 4:44 PM

Right. If I add an additional filter on a dimension (dim4) where every value of key3 will have the same value for dim4, I get the correct results

Ken Krugler

09/28/2022, 4:44 PM

No, it’s a new key that has a lower sum than key3

Ken Krugler

09/28/2022, 4:45 PM

So instead of getting key3 (with patched Pinot 0.9), I get key4

Rong R

09/28/2022, 4:45 PM

at the 3rd ordered value

Ken Krugler

09/28/2022, 4:45 PM

Even though sum(metric) for key3 is > sum(metric) for key4

Ken Krugler

09/28/2022, 4:45 PM

Right

Rong R

09/28/2022, 4:45 PM

oh. so it is problem of order by then

Ken Krugler

09/28/2022, 4:45 PM

And even if I set LIMIT 10000, that result set doesn’t contain key3

Rong R

09/28/2022, 4:46 PM

oh... so key3 is MISSING entirely

Ken Krugler

09/28/2022, 4:46 PM

Right, at least for top 10000 groups

Ken Krugler

09/28/2022, 4:46 PM

I see that there are a bit more than 10K records that match key3 and the other filters in the query.

Rong R

09/28/2022, 4:46 PM

could you try

sum(CAST(metrics AS DOUBLE)

Ken Krugler

09/28/2022, 4:47 PM

Same result

Rong R

09/28/2022, 4:47 PM

haha worth a short

Ken Krugler

09/28/2022, 4:48 PM

OK, something weird. If I remove the

AND dim3 NOT IN ('yy', 'zz')

then I get the old result

Ken Krugler

09/28/2022, 4:48 PM

(the correct result)

Rong R

09/28/2022, 4:49 PM

interesting. how about replace it with

(dim3 = 'yy' or dim3 = 'zz')

Rong R

09/28/2022, 4:49 PM

^ i meant not equal with and clause.

Ken Krugler

09/28/2022, 4:50 PM

Right - I changed to

AND dim3 != 'yy' AND dim3 != 'zz'

but get the same (incorrect) result

Ken Krugler

09/28/2022, 4:52 PM

Also

AND dim3 NOT IN ('yy', 'zz')

isn’t filtering out results that changed (or changed significantly) the sum(metric) for the top 3 results.

Ken Krugler

09/28/2022, 4:54 PM

And in trace output,

"numGroupsLimitReached": false,

Rong R

09/28/2022, 4:54 PM

if you only put

dim3 != 'yy'

Rong R

09/28/2022, 4:54 PM

if it returns something with key3... i felt like the range predicate rewrite rule might be broken

Ken Krugler

09/28/2022, 4:55 PM

I don’t get key3 as the #3 result (so same new/incorrect results)

Ken Krugler

09/28/2022, 4:56 PM

I can do another deep query (LIMIT 10000) and see if key3 is in that set

Rong R

09/28/2022, 4:56 PM

ok. does dim3 have index?

Ken Krugler

09/28/2022, 4:57 PM

Checking…

Ken Krugler

09/28/2022, 4:58 PM

Yes - it’s in the

invertedIndexColumns

set of dimension fields

Ken Krugler

09/28/2022, 5:17 PM

A few more odd details - if I look at a server log when using the dim3 filter, I see:

Copy code

numDocsScanned=16484299,
scanInFilter=0,
scanPostFilter=32968598,

But when I remove the dim3 filter line, I get:

Copy code

numDocsScanned=762028,
scanInFilter=0,
scanPostFilter=1524056,

I don’t understand why

numDocsScanned

is so much higher when I add a filter - I would assume it went the other direction.

Ken Krugler

09/28/2022, 5:39 PM

I feel like there were changes (e.g. https://github.com/apache/pinot/pull/6991) that I thought I’d patched into our older Pinot build, so I wouldn’t expect that to have caused a change.

Rong R

09/28/2022, 5:58 PM

@Jackie

Jackie

09/28/2022, 7:43 PM

Do you have star-tree index for this table?

Jackie

09/28/2022, 7:45 PM

What is the cardinality of the key column?

Jackie

09/28/2022, 7:46 PM

One thing worth noting is that

numGroupsLimitReached

is not properly set until release 0.11 (PR: https://github.com/apache/pinot/pull/8393)

Ken Krugler

09/28/2022, 8:43 PM

@Jackie - yes, we have a star-tree index. The cardinality for the key column is very high (I would guess 2M per segment).

Ken Krugler

09/28/2022, 8:50 PM

@Jackie - but what would change with having versus not having the the

NOT IN

filter? The dimension being filtered on in the

NOT IN

clause (

dim3

) is not part of the star tree, whereas

dim1

dim2

and

key

are in the

dimensionsSplitOrder

list. Don’t know if it matters, but

dim1

is in the

skipStarNodeCreationForDimensions

list.

Jackie

09/28/2022, 9:10 PM

That is fine. Because

dim3

is not included in the split order, we won't be able to use star-tree to solve the query if there is a filter on

dim3

, which is the reason why it you see much higher

numDocsScanned

Jackie

09/28/2022, 9:13 PM

I suspect the problem is from reaching the groups limit. Can you try increasing the groups limit on the server and see if the result changes?

Ken Krugler

09/28/2022, 9:14 PM

I’m using

OPTION(minServerGroupTrimSize=1000000)

, shouldn’t that do it?

Jackie

09/28/2022, 9:15 PM

No, that won't change the groups limit. You'll have to configure it on the server under the key `pinot.server.query.executor.num.groups.limit`: https://docs.pinot.apache.org/configuration-reference/server

Ken Krugler

09/28/2022, 9:15 PM

Also on https://docs.pinot.apache.org/users/user-guide-query/grouping-algorithm I see

pinot.server.query.executor.num.groups.limit

, which doesn’t seem to have a query override?

Jackie

09/28/2022, 9:15 PM

Yeah, we don't have query override for that as of now

Ken Krugler

09/28/2022, 9:16 PM

OK, that’s promising. When was

pinot.server.query.executor.num.groups.limit

introduced?

Jackie

09/28/2022, 9:17 PM

That is introduced before 0.9, but not sure if it is handled the same way in 0.9 vs 0.10. So we want to first verify if that is causing the different result

Ken Krugler

09/28/2022, 9:20 PM

I checked the server logs, and see:

Copy code

./pinotServer-03-17-2022-2.log:2022/03/17 08:58:19.022 INFO [InstancePlanMakerImplV2] [Start a Pinot [SERVER]] Initializing plan maker with maxInitialResultHolderCapacity: 10000, numGroupsLimit: 100000, enableSegmentTrim: false, minSegmentGroupTrimSize: -1

So based on https://github.com/apache/pinot/issues/8089, would I also need to set the

groupby.trim.threshold

property?

Jackie

09/28/2022, 9:22 PM

No, the trim threshold can be overridden by query if needed

Jackie

09/28/2022, 9:23 PM

To get absolute accurate result, you will need to remove the groups limit as well as the segment/server trim. But I feel removing the groups limit should be able to get you the correct result as long as the data distribution is not too skewed

Ken Krugler

09/28/2022, 9:24 PM

I seem to remember discussing with you whether the server process can/should use a priority queue when calculating groups and there’s an ORDER BY…is that still pending?

Jackie

09/28/2022, 9:32 PM

I think you are referring to the result trimming within the segment. That is not added yet. We need to evaluate the cost of it

Ken Krugler

09/28/2022, 9:39 PM

I’m still surprised that adding in the

AND dim3 NOT IN ('yy', 'zz')

filter triggers incorrect results, since it doesn’t seem like adding this filter changes the summed value for the top 3 hits. Any thoughts on that?

Jackie

09/28/2022, 9:49 PM

It can change the results if the groups limit is reached. When star-tree is used, we use the pre-aggregated records, which have different order with the raw records. Depending on which key returned first, we will ignore the remaining keys after the limit is reached

Ken Krugler

09/28/2022, 10:11 PM

Thanks, that’s very interesting! And that would explain why so many more records were scanned with the filter, since the filter’s dimension isn’t part of the star-tree definition, so it’s a full scan of the segment.

Ken Krugler

09/29/2022, 5:40 PM

@Jackie - thanks so much, that was it!!!

👍 2

Open in Slack

Previous Next