Hi Experts. Need some help. Our Pinot queries are ...
# troubleshooting
h
Hi Experts. Need some help. Our Pinot queries are aggregation heavy and I have observed a lot of them are quite slow. All of the queries have a range filter in it like -
( start_time_millis >= 1666256876000 AND start_time_millis < 1666260935000 )
where
start_time_millis
is the timeColumnName. Most of the queries have a range filter to get data for <last 6 hours. We added the Startree index to improve latency, however, we cannot leverage it since the segment size is big.
max(start_time_millis) - min(start_time_millis)
for a segment comes out to be > ~6 hours. All the segments have around ~6 hours gap for
start_time_millis
. If we don’t add
start_time_millis
in dimension split order, the startree index doesn’t get picked (as the segment’s time range is not the subset of the queried time range in most of the cases). And we cannot add
start_time_millis
in dimension split order due to high cardinality and it consumes a lot of disk space. We are looking to fix this problem. We want to leverage the startree index and hence are looking to reduce the number of Kafka partitions in order to reduce segment size. We want the segment size to be close to ~1 hour. Our tables have around ~40050 segments. Hence I wanted to know if decreasing the number of partitions is the right path and what can be other action items, we can perform to solve this problem.
r
Hi Harnoor, have you considered stream pre-processing for this? It may make more sense to perform the aggregations before ingesting the data into Pinot.
j
you can also do the aggregations in pinot, https://docs.pinot.apache.org/developers/advanced/ingestion-level-aggregations, but you will need to give up query flexibility and pick a minimum granularity for your time column
h
Thanks @Riley Johnson @Johan Adami. Unfortunately Ingestion Aggregations or stream pre processing does not support:
DISTINCTCOUNTHLL
and
PERCENTILE_TDIGEST
. Also minimum granularity has to be in milliseconds and exact timestamp. Seems like startree index is the only way forward.