Apache Pinot #general

Join Slack

Kishore G

02/18/2020, 8:32 PM

We can do early termination if there is no order by

Kishore G

02/18/2020, 8:33 PM

But with order by, there is nothing much we can do to terminate early...

Kishore G

02/18/2020, 8:34 PM

What is the problem you are trying to solve?

Ting Chen

02/18/2020, 8:59 PM

the main issue we have is query latency is too long (~15 s). For early termination, since the table is physically sorted by the ORDER_BY column, I suppose an ideal plan is to check the relevant segments (starting with the segments with the largest value in the filtering range) and stop when enough results have been collected?

Kishore G

02/18/2020, 9:02 PM

That’s possible, what is the time range in the query

Ting Chen

02/18/2020, 9:04 PM

from 7 days ago to a few second ago. Basically the past 7 days' data.

Kishore G

02/18/2020, 9:10 PM

It’s a good optimization to have. Worth starting a thread and discussing further. For now, is it possible for the client to break it up into multiple queries- one for each day?

Ting Chen

02/18/2020, 9:12 PM

I will file an issue for this and do some investigation on codes. Yes, you idea is basically the walk-around for now. I ask the customers to look for the past 1 day's data instead: they still got their results needed while the latency is halved.

Kishore G

02/18/2020, 9:14 PM

Cool. What you want is doable with some optimization in the planning phase..

Xiang Fu

02/18/2020, 9:46 PM

@User can you also send this to dev@pinot.apache.org mailing list

Xiang Fu

02/18/2020, 9:46 PM

one thing about this is that the query will hit many segments and merge the results

Xiang Fu

02/18/2020, 9:47 PM

so it’s hard to tell the global ordering to do early termination

Ting Chen

02/18/2020, 10:22 PM

Conversation sent to the mailing list. I was wondering if we can reduce the segments hit given that they are already sorted by the ORDER BY column.

Ting Chen

02/18/2020, 10:25 PM

or somehow we can early terminate when the segments with the largest ORDER BY values have enough return values.

Jackie

02/20/2020, 12:34 AM

@User Can you please check the

numEntriesScannedInFilter

metadata in the query response? If that is not

, you might want to try out the latest version where we optimized the case for RANGE on sorted column

Jackie

02/20/2020, 12:35 AM

Ideally it should only collect 500 documents from each segment, which should not be this slow, so I am wondering whether the time is spent on scanning the records

Ting Chen

02/20/2020, 12:38 AM

when this optimization is done roughly? Which PR is that ? our Prod server runs the version of April 19.

Jackie

02/20/2020, 12:39 AM

It is fairly new: https://github.com/apache/incubator-pinot/pull/5013

Ting Chen

02/20/2020, 1:00 AM

we are not up to this version yet. Also in the PR, it says the optimization applies to offline segments only. offline segments here mean segments in offline tables only?

Jackie

02/20/2020, 1:03 AM

@User Offline means on disk segments (vs in memory for CONSUMING segments), so committed realtime segments can also benefit from this.

Jackie

02/20/2020, 1:05 AM

Is this a realtime table? What is the granularity (days, seconds, milliseconds etc.) of your timestamp?

Ting Chen

02/20/2020, 1:07 AM

got it. it is a realtime table, the timestamp is in millisec.

Ting Chen

02/20/2020, 1:08 AM

does the granularity matters here? we sort the table based on the timestamp column.

Jackie

02/20/2020, 1:09 AM

Then I suspect the long latency is caused by the consuming segment

Jackie

02/20/2020, 1:09 AM

When consuming from stream, the column is not sorted yet. It becomes sorted when the segment is committed

Jackie

02/20/2020, 1:11 AM

For consuming segment, even the dictionary is not sorted, so we have to scan the whole dictionary in order to get the matching values for the range, then we need to scan the column

Ting Chen

02/20/2020, 1:11 AM

is there any metrics we can track down to such level (i.e., segment level)?

Jackie

02/20/2020, 1:13 AM

You can try to not query the consuming segment and see whether the latency is lower

Ting Chen

02/20/2020, 1:15 AM

will try. also look like make segment smaller can also alleviate the issue but in our case it is a very big table.

Jackie

02/20/2020, 1:16 AM

If the bottleneck is on the consuming segment, then making segment smaller can definitely help