For most of Time Series /Audit data, Time Criteria...
# general
v
For most of Time Series /Audit data, Time Criteria is the basic one. (E.g) For one-year data, segments created on daily basis will have 365 segments per year. Even for queries that access only last month, last week data will be scheduled to scan all segments including unnecessary ones. is it possible to maintain min/max values of the primary time column in table Meta ?. maintaining time column meta will help broker side segment pruning similar to partition.
m
Pinot already does that and prunes segments based on min-max time stamp in the segment metadata.
v
so query which accesses last week data(7 segments) will be scheduled to scan only 7 segments ?. Does segment pruning happen at the broker level itself or at server level?
m
We have some pruning that happens at broker and other server level
Yes, only 7 days of segment will be processed. also Pinot has sorted and inv index that can be used to further avoid scanning all data inside these 7 segments
v
1. Based on my understanding from documentation, partitioning helps segment pruning at the broker level itself. 2. For last week's data query, all 365 segments will be scheduled in the broker and only 7 segments will be processed in the server remaining segments will be pruned in the server based on segment metadata. 3. My suggestions is to handle main time-column criteria similar to partition column criteria. i.e pruning ar broker level to avoid unnecessary scheduling to avoid cpu wastage.
please let me know if my understanding is wrong
m
Yes we have optimized these based on real production use cases. There is always a balance, eye broker needs to read metadata from zk, or cache it, so that is the overhead. but these are optimizations we consider at thousands of qps and millisecond latency. Is your usecase in that range? If not then you might be over optimizing?
v
ok fine. For now, we expect 500 qps only and with sub 100 ms latency. we will test and let you know if any issue due to overscheduling.
m
Yeah, server level pruning + partitioning + sorting + inv index + replica group will give you much better than that.
👍 1