https://pinot.apache.org/ logo
Join Slack
Powered by
# general
  • k

    Kishore G

    02/18/2020, 8:32 PM
    We can do early termination if there is no order by
  • k

    Kishore G

    02/18/2020, 8:33 PM
    But with order by, there is nothing much we can do to terminate early...
  • k

    Kishore G

    02/18/2020, 8:34 PM
    What is the problem you are trying to solve?
  • t

    Ting Chen

    02/18/2020, 8:59 PM
    the main issue we have is query latency is too long (~15 s). For early termination, since the table is physically sorted by the ORDER_BY column, I suppose an ideal plan is to check the relevant segments (starting with the segments with the largest value in the filtering range) and stop when enough results have been collected?
  • k

    Kishore G

    02/18/2020, 9:02 PM
    That’s possible, what is the time range in the query
  • t

    Ting Chen

    02/18/2020, 9:04 PM
    from 7 days ago to a few second ago. Basically the past 7 days' data.
  • k

    Kishore G

    02/18/2020, 9:10 PM
    It’s a good optimization to have. Worth starting a thread and discussing further. For now, is it possible for the client to break it up into multiple queries- one for each day?
  • t

    Ting Chen

    02/18/2020, 9:12 PM
    I will file an issue for this and do some investigation on codes. Yes, you idea is basically the walk-around for now. I ask the customers to look for the past 1 day's data instead: they still got their results needed while the latency is halved.
  • k

    Kishore G

    02/18/2020, 9:14 PM
    Cool. What you want is doable with some optimization in the planning phase..
  • x

    Xiang Fu

    02/18/2020, 9:46 PM
    @User can you also send this to dev@pinot.apache.org mailing list
  • x

    Xiang Fu

    02/18/2020, 9:46 PM
    one thing about this is that the query will hit many segments and merge the results
  • x

    Xiang Fu

    02/18/2020, 9:47 PM
    so it’s hard to tell the global ordering to do early termination
  • t

    Ting Chen

    02/18/2020, 10:22 PM
    Conversation sent to the mailing list. I was wondering if we can reduce the segments hit given that they are already sorted by the ORDER BY column.
  • t

    Ting Chen

    02/18/2020, 10:25 PM
    or somehow we can early terminate when the segments with the largest ORDER BY values have enough return values.
  • j

    Jackie

    02/20/2020, 12:34 AM
    @User Can you please check the
    numEntriesScannedInFilter
    metadata in the query response? If that is not
    0
    , you might want to try out the latest version where we optimized the case for RANGE on sorted column
  • j

    Jackie

    02/20/2020, 12:35 AM
    Ideally it should only collect 500 documents from each segment, which should not be this slow, so I am wondering whether the time is spent on scanning the records
  • t

    Ting Chen

    02/20/2020, 12:38 AM
    when this optimization is done roughly? Which PR is that ? our Prod server runs the version of April 19.
  • j

    Jackie

    02/20/2020, 12:39 AM
    It is fairly new: https://github.com/apache/incubator-pinot/pull/5013
  • t

    Ting Chen

    02/20/2020, 1:00 AM
    we are not up to this version yet. Also in the PR, it says the optimization applies to offline segments only. offline segments here mean segments in offline tables only?
  • j

    Jackie

    02/20/2020, 1:03 AM
    @User Offline means on disk segments (vs in memory for CONSUMING segments), so committed realtime segments can also benefit from this.
  • j

    Jackie

    02/20/2020, 1:05 AM
    Is this a realtime table? What is the granularity (days, seconds, milliseconds etc.) of your timestamp?
  • t

    Ting Chen

    02/20/2020, 1:07 AM
    got it. it is a realtime table, the timestamp is in millisec.
  • t

    Ting Chen

    02/20/2020, 1:08 AM
    does the granularity matters here? we sort the table based on the timestamp column.
  • j

    Jackie

    02/20/2020, 1:09 AM
    Then I suspect the long latency is caused by the consuming segment
  • j

    Jackie

    02/20/2020, 1:09 AM
    When consuming from stream, the column is not sorted yet. It becomes sorted when the segment is committed
  • j

    Jackie

    02/20/2020, 1:11 AM
    For consuming segment, even the dictionary is not sorted, so we have to scan the whole dictionary in order to get the matching values for the range, then we need to scan the column
  • t

    Ting Chen

    02/20/2020, 1:11 AM
    is there any metrics we can track down to such level (i.e., segment level)?
  • j

    Jackie

    02/20/2020, 1:13 AM
    You can try to not query the consuming segment and see whether the latency is lower
  • t

    Ting Chen

    02/20/2020, 1:15 AM
    will try. also look like make segment smaller can also alleviate the issue but in our case it is a very big table.
  • j

    Jackie

    02/20/2020, 1:16 AM
    If the bottleneck is on the consuming segment, then making segment smaller can definitely help
1...112113114...160Latest