harnoor
10/20/2022, 6:19 PM( start_time_millis >= 1666256876000 AND start_time_millis < 1666260935000 )
where start_time_millis
is the timeColumnName. Most of the queries have a range filter to get data for <last 6 hours.
We added the Startree index to improve latency, however, we cannot leverage it since the segment size is big. max(start_time_millis) - min(start_time_millis)
for a segment comes out to be > ~6 hours. All the segments have around ~6 hours gap for start_time_millis
. If we don’t add start_time_millis
in dimension split order, the startree index doesn’t get picked (as the segment’s time range is not the subset of the queried time range in most of the cases). And we cannot add start_time_millis
in dimension split order due to high cardinality and it consumes a lot of disk space.
We are looking to fix this problem. We want to leverage the startree index and hence are looking to reduce the number of Kafka partitions in order to reduce segment size. We want the segment size to be close to ~1 hour. Our tables have around ~40050 segments. Hence I wanted to know if decreasing the number of partitions is the right path and what can be other action items, we can perform to solve this problem.Riley Johnson
10/24/2022, 9:55 PMJohan Adami
10/24/2022, 9:59 PMharnoor
10/27/2022, 10:43 AMDISTINCTCOUNTHLL
and PERCENTILE_TDIGEST
. Also minimum granularity has to be in milliseconds and exact timestamp. Seems like startree index is the only way forward.