Our data has more than 400 different dimensions. C...
# troubleshooting
d
Our data has more than 400 different dimensions. Cube only has 25 of them, but we are planning to increase it, We are aware that adding a new dimension would increase volume with Cardinality of the new dimension (in the worst case). Is there a recommendation on the number of the dimensions too ?? As in how many dimensions I can add around the "group by".
k
its the cardinality product of the group columns. By default we limit that to 100k in the group by query. (note this is in the actual query, not based on actual data)
you can increase this limit but that will require you to up the memory of the server as needed
400 dimensions is not a problem since its columnar
d
thank you, that's helpful. I did see that Pinot supports Dictionary encoding and we have lots of dimensions with low cardinality like os-version, os-type, ips , state, country, segment, cohort , seasonality, etc
k
yes.
m
Yes, low cardinality columns will compress very well.
k
@Kishore G what happens if the cardinality of a column being used for group by is > 100K? If the query has an order by, will it use a priority queue to keep around the (approximate) top results?
k
yes
k
So unless the data is weirdly skewed, if our LIMIT is something significantly lower (like 1000) then the final results should be exact, or nearly exact.
k
yes
this is really to protect against bad queries that might be run accidentally like select memberId, sum(views) from T and cardinality of memberId is in millions
we still execute the query but only return top X and once the priority queue reached 100k, we drop new entries and only keep updating the existing keys
you can see this in the response stats
numGroupsLimitReached
k
OK, thanks. What if it’s something like
select memberId,sum(views) from T group by memberId order by sum(views) desc limit 1000
Here the “top” results are ones with the smallest number of views. Will the priority queue use the order by information to correctly keep the memberId groups with the smallest sum?
k
priority queue is setup based on the order by clause