https://pinot.apache.org/ logo
t

Tanmay Movva

11/16/2020, 8:08 AM
Hello, I was using the realtime provisioning tool with the following sample data
Copy code
Size of segment directory = 239.4mb
Number of documents = 3529197
retention = 30 days
ingestion rate = 1000
numPartitions = 1
Table Replicas = 1
I am running this by building pinot from source and the results are quite surprising.
message has been deleted
81g per host server memory looked suspicious and upon looking into the code I observe that it is considering the segment directories to be present in memory completely. That is
Copy code
active memory per host = Number of active(consuming + completed/retained) segments * Size of the segment directory.
From what I understand, these segment directories should be present on server’s disk and maintain a mmap for indexes in memory. So that at query time based on demand, segments can be paged in. Please correct me if I am wrong.
Can anyone please help me out here? Let me know if I have to raise an issue on github for this. Thanks!!
k

Kishore G

11/16/2020, 3:34 PM
That formula does not seem right @Neha Pawar ^^
s

Subbu Subramaniam

11/16/2020, 4:53 PM
The formula is correct. Pinot maps all segments into memory. Since you have 30d retention, the code expects that at any time, all of the 30d will be used for queries. If this is not the case, you can choose to reduce the ratio of mapped vs active memory by increasing the total amount of memory available, in the command line argument. (Btw, if you cut/paste the entire output it helps better).
The more paging you have, the latency is likely to be higher. Consuming segments are read-write memory, so mapping them on to disk can cause paging when rows are ingested since writes happen. To improve performance, you may want to map the memory for consuming segments on a
tmpfs
file system. Completed segments do not suffer writes, so you can choose to tune it mapping as much memory as you would like that does not affect your latency.
t

Tanmay Movva

11/17/2020, 2:51 AM
If this is not the case, you can choose to reduce the ratio of mapped vs active memory by increasing the total amount of memory available
By this do you mean
maxUsableHostMemory
argument? I’ve already set this to 90g, because anything less than the 82g(because the required server host memory as per my output is 81g) doesn’t give me any output(kinda obvious from the code). If not this, where can I set the ratio or read about it?
The more paging you have, the latency is likely to be higher. Consuming segments are read-write memory, so mapping them on to disk can cause paging when rows are ingested since writes happen
I get this part. Have no issue with consuming segments or realtime servers. Wanted to get an idea on how should I provision my offline server(which would host completed segments) so that segments are on disk with mmap and are paged in when required.