Hi Team, I have some questions regarding the quot...
# general
m
Hi Team, I have some questions regarding the quota related configuration of the Dimension table: We are running a Pinot Cluster setup having 4 servers and we have created a dimension table with the storage configured to be "200 MB". We are populating this dimension table with segments having 100k records and the uncompressed size of this segment is around ~23 MB (232 byte for each record in the segment). So as per the calculation, we were expecting that the table will be able to hold around ~900k records. However, we see only 200k records in the table and when we are pushing 3rd segment of 100k records, we are getting 403 error. Can someone please help us here as to why we are getting 403 error and not able to push any more segment of 100k record.
m
Which command are you running that returns the 403 error?
m
Probably the offline segment push.
m
ah
m
We have defined a batch ingestion job, which is pushing the CSV of 100k records to the offline. We are getting 403 error and message which says total size exceeds quota size
m
Can you check the dim table size using the rest api? And also the uncompressed segment size you are trying to push?
m
Size of the dimension table is 194 MB (Has 2 segments of 100k records). Size of uncompressed segment size is 23 MB
To explore more on this part, I went ahead and created one segment of 860k records (200 MB / 232 bytes (size of each record) ~ 860k) and tried pushing this segment to a different table which had zero segment previously and this segment was processed by Pinot without any errors. So when I push segment of 100k records, Pinot fails to process more than 2 segment. However, one segment of 860k records got processed
What should the replication value be defined in the table configuration of Dimension table if we are running a cluster with 4 servers? Should it be 1 or 4?
m
I don’t think replication is needed for this one
The table size api gives the right size
If that is 194MB then you are close to the quota
m
Yeah 194 is close to quota. However, when I pushed segment of 860k record to a different new table, the size of that table is 832 MB, despite having the quota set to 200 MB?
On checking the code this is what I could understand: 1. We are getting the replication number from the table configuration, since the replication number in our case is 1, the allowedSize is 200 MB * 1 (number of replica) = 200 MB (Code Link) 2. Whenever a new segment is getting pushed Pinot is calculating the size by summing up the size of uncompressed segments present on all the replicas, since we are running a 4 cluster setup, total size is which I feel is getting compute in this fashion:
Copy code
23 MB (Size of each segment) * 2 (Number of segment already present) * 4 (since we have 4 nodes and dimension table is copied on all the nodes):
= 23 * 2 * 4 = 184 MB
Size of the new incoming uncompressed segment = 23 MB,
so the newly estimated size = 184 + 23 = 207, this is greater than 200 MB that is why Pinot is not processing any more segment of 100k records.
So I think since the code is fetching the replication number from the table configuration, we should specify the replication value to be 4 and not 1, as the tables are already replicated on all the hosts?
m
I see, if so that seems like a bug. Could you please file a GH issue?
m
Sure @User, I can file a Github Issue. Before doing that let me summarise if the understanding is right or no, so that I can add the relevant details in the issue: 1. Quota configuration is at table level, so if a table is configured to be of size "200MB" that means each table should be able to store 200 MB of data 2. Since the Dimension tables are replicated across all the servers, the replication factor should be appropriately mentioned in the table configuration. If we are running a 4 server cluster setup, then the replication factor should be set to 4 so that the calculations are right. Is the understanding right?
m
What do you mean by 4 cluster setup? Do you mean for Pinot servers?
I’d say the question is if dimension table should honor replication at all or not (for quota), because it does not seem to apply to it
m
Yes I mean 4 server in the cluster
m
Thanks