Our data has more than 400 different dimensions Cube only ha Apache Pinot #troubleshooting

Our data has more than 400 different dimensions. C...

dhurandar

12/18/2020, 5:51 PM

Our data has more than 400 different dimensions. Cube only has 25 of them, but we are planning to increase it, We are aware that adding a new dimension would increase volume with Cardinality of the new dimension (in the worst case). Is there a recommendation on the number of the dimensions too ?? As in how many dimensions I can add around the "group by".

Kishore G

12/18/2020, 5:55 PM

its the cardinality product of the group columns. By default we limit that to 100k in the group by query. (note this is in the actual query, not based on actual data)

Kishore G

12/18/2020, 5:55 PM

you can increase this limit but that will require you to up the memory of the server as needed

Kishore G

12/18/2020, 5:56 PM

400 dimensions is not a problem since its columnar

dhurandar

12/18/2020, 5:57 PM

thank you, that's helpful. I did see that Pinot supports Dictionary encoding and we have lots of dimensions with low cardinality like os-version, os-type, ips , state, country, segment, cohort , seasonality, etc

Kishore G

12/18/2020, 5:58 PM

yes.

Mayank

12/18/2020, 6:00 PM

Yes, low cardinality columns will compress very well.

Ken Krugler

12/18/2020, 7:52 PM

@Kishore G what happens if the cardinality of a column being used for group by is > 100K? If the query has an order by, will it use a priority queue to keep around the (approximate) top results?

Kishore G

12/18/2020, 7:52 PM

yes

Ken Krugler

12/18/2020, 7:54 PM

So unless the data is weirdly skewed, if our LIMIT is something significantly lower (like 1000) then the final results should be exact, or nearly exact.

Kishore G

12/18/2020, 7:55 PM

yes

Kishore G

12/18/2020, 7:56 PM

this is really to protect against bad queries that might be run accidentally like select memberId, sum(views) from T and cardinality of memberId is in millions

Kishore G

12/18/2020, 7:57 PM

we still execute the query but only return top X and once the priority queue reached 100k, we drop new entries and only keep updating the existing keys

Kishore G

12/18/2020, 7:57 PM

you can see this in the response stats

Kishore G

12/18/2020, 7:58 PM

numGroupsLimitReached

Ken Krugler

12/18/2020, 8:00 PM

OK, thanks. What if it’s something like

select memberId,sum(views) from T group by memberId order by sum(views) desc limit 1000

Here the “top” results are ones with the smallest number of views. Will the priority queue use the order by information to correctly keep the memberId groups with the smallest sum?

Kishore G

12/18/2020, 8:01 PM

priority queue is setup based on the order by clause

Open in Slack

Previous Next