hey we have a use case which consists of events emitted over Apache Pinot #general

hey, we have a use case which consists of events e...

Stuart Coleman

09/09/2022, 4:11 PM

hey, we have a use case which consists of events emitted over time where we have two dimension columns. One is cardinality of approx 10 million and one is of cardinality of approx 100. The business use case is to compute aggregates (min, max, sum, count) against this data filtered by specific values of the cardinality 100 column and either grouped by or selecting a specific value of the order 10 million cardinality column. There is also a filter over time range, so applying the above filters in addition to a filter on the timestamp column which restricts the aggregates to the last day/month/quarter/year. This feels like something that would be well served by the startree index, but i'm struggling to understand how that interacts with the time filtering aspect of the query. The examples in the docs for startree all seem to refer to the whole table without a time filter. Does anyone have any tips on best indexing strategy for this?

Subbu Subramaniam

09/09/2022, 4:43 PM

Not sure why you needed Startree index. It seems to me that keeping the C100 column as a sorted column should be good enough? Are you seeing performance problems? Maybe I am missing something

Mayank

09/09/2022, 6:50 PM

StarTree index helps in case where your queries are really low selectivity (eg > 1M rows being selected) for aggregation. You may start off with sorting on the high cardinality column (since you mentioned the queries will have an equality predicate on this column). Inv index on the other, range index on time if needed. This should already give you great performance

Stuart Coleman

09/10/2022, 5:45 AM

thanks both, that makes sense. i'll try that

Open in Slack

Previous Next