Hi Everyone, I have a question about performing l...
# pinot-perf-tuning
b
Hi Everyone, I have a question about performing lots of calculations. We have a few queries where we need to calculate the sum of each column after applying a series of where conditions. (10k+ columns with Boolean values) When we have many AND conditions the query is quite fast, as we collect the data and retrieve all the sums on a relatively flat dataset. However when we have a very limited number of AND conditions the query slows way down, as for each column we need to calculate the SUM for millions of rows. This has a propensity to take a very long time or worse, throw an out of memory error for the java heap space. I don't think that pre-aggregations will work here as they are for an entire column and not an entire column with a condition. (please correct me if I'm wrong) Furthermore, our data filtering and retrieval seems to be quite fast, it doesn't feel like it's an issue of retrieving the data but more aggregating the data. Any thoughts on increasing performance? Any ideas or thoughts are welcome.
k
did you try startree index?
b
because of the rather large size of the tables, we routinely are running out of resources when applying startree indexing on an EMR cluster. If that would likely solve the problem, then an alternative line of solutioning would be optimizing our startree indexed segment generation on EMR.
k
yeah strongly suggest trying startree index
b
awesome, appreciate the feedback! Just curious, if we are unable to produce a startree index -- are there any other pieces of advice that you can provide? for example would horizontally or vertically scaling our server pods be a good solution for performing a large number of SUM operations? Should we have a fewer number of larger segments? Or would a large number of small segments serve our purposes better? Any and all feedback is welcome
k
yes.. scaling will definitely help. If you can share the query and response metadata, we might be able to suggest a more concrete solution
b
The Query: SUM(boolean_field), SUM(boolean_field), ... 15k SUM(boolean_field) FROM table WHERE boolean_field = 1 or boolean_field = 0 and (boolean_field = 1 and boolean_field = 1) As for response metadata there is very little aside from thimeUsedMs and totalDocs -- is there a flag I can pass in with the API call to gather more stats?
Also, just from the perspective of scaling - given the nature of the dataset and the query, which kind of scaling will help us more in the near-term? Horizontal vs. Vertical?