Hi Experts Need some help Our Pinot queries are aggregation Apache Pinot #troubleshooting

Hi Experts. Need some help. Our Pinot queries are ...

harnoor

10/20/2022, 6:19 PM

Hi Experts. Need some help. Our Pinot queries are aggregation heavy and I have observed a lot of them are quite slow. All of the queries have a range filter in it like -

( start_time_millis >= 1666256876000 AND start_time_millis < 1666260935000 )

where

start_time_millis

is the timeColumnName. Most of the queries have a range filter to get data for <last 6 hours. We added the Startree index to improve latency, however, we cannot leverage it since the segment size is big.

max(start_time_millis) - min(start_time_millis)

for a segment comes out to be > ~6 hours. All the segments have around ~6 hours gap for

start_time_millis

. If we don’t add

start_time_millis

in dimension split order, the startree index doesn’t get picked (as the segment’s time range is not the subset of the queried time range in most of the cases). And we cannot add

start_time_millis

in dimension split order due to high cardinality and it consumes a lot of disk space. We are looking to fix this problem. We want to leverage the startree index and hence are looking to reduce the number of Kafka partitions in order to reduce segment size. We want the segment size to be close to ~1 hour. Our tables have around ~40050 segments. Hence I wanted to know if decreasing the number of partitions is the right path and what can be other action items, we can perform to solve this problem.

Riley Johnson

10/24/2022, 9:55 PM

Hi Harnoor, have you considered stream pre-processing for this? It may make more sense to perform the aggregations before ingesting the data into Pinot.

Johan Adami

10/24/2022, 9:59 PM

you can also do the aggregations in pinot, https://docs.pinot.apache.org/developers/advanced/ingestion-level-aggregations, but you will need to give up query flexibility and pick a minimum granularity for your time column

harnoor

10/27/2022, 10:43 AM

Thanks @Riley Johnson @Johan Adami. Unfortunately Ingestion Aggregations or stream pre processing does not support:

DISTINCTCOUNTHLL

and

PERCENTILE_TDIGEST

. Also minimum granularity has to be in milliseconds and exact timestamp. Seems like startree index is the only way forward.

Open in Slack

Previous Next