https://pinot.apache.org/ logo
j

João Comini

11/30/2020, 9:08 PM
Hello guys, how are you? I'm having some trouble understanding the results from the 
RealtimeProvisioningHelper
, may you help me? These are my doubts: • Why do we need a
numHours
parameter? What's the impact of having a consuming segment for a certain amount of time (pros/cons)? • And what does
Mapped
means in the
Memory used per host
result? Is it about the segments in disk? This is the results that I got:
Copy code
RealtimeProvisioningHelper -tableConfigFile /tmp/transaction-table.json -numPartitions 20 -pushFrequency null -numHosts 4,8,12,16,20 -numHours 24,48,72,96 -sampleCompletedSegmentDir /tmp/out/transaction_1606528528_1606614928_0 -ingestionRate 4 -maxUsableHostMemory 16G -retentionHours 768

Note:

* Table retention and push frequency ignored for determining retentionHours since it is specified in command
* See <https://docs.pinot.apache.org/operators/operating-pinot/tuning/realtime>

Memory used per host (Active/Mapped)

numHosts --> 4               |8               |12              |16              |20              |
numHours
24 --------> 6.8G/71.9G      |3.4G/35.95G     |2.72G/28.76G    |2.04G/21.57G    |1.36G/14.38G    |
48 --------> 7.33G/72.62G    |3.66G/36.31G    |2.93G/29.05G    |2.2G/21.79G     |1.47G/14.52G    |
72 --------> 8.01G/73.11G    |4.01G/36.55G    |3.2G/29.24G     |2.4G/21.93G     |1.6G/14.62G     |
96 --------> 8.39G/74.08G    |4.2G/37.04G     |3.36G/29.63G    |2.52G/22.22G    |1.68G/14.82G    |

Optimal segment size

numHosts --> 4               |8               |12              |16              |20              |
numHours
24 --------> 20.02M          |20.02M          |20.02M          |20.02M          |20.02M          |
48 --------> 40.04M          |40.04M          |40.04M          |40.04M          |40.04M          |
72 --------> 60.05M          |60.05M          |60.05M          |60.05M          |60.05M          |
96 --------> 80.07M          |80.07M          |80.07M          |80.07M          |80.07M          |

Consuming memory

numHosts --> 4               |8               |12              |16              |20              |
numHours
24 --------> 756.05M         |378.02M         |302.42M         |226.81M         |151.21M         |
48 --------> 1.47G           |750.11M         |600.09M         |450.07M         |300.04M         |
72 --------> 2.15G           |1.07G           |878.76M         |659.07M         |439.38M         |
96 --------> 2.92G           |1.46G           |1.17G           |896.57M         |597.71M         |

Total number of segments queried per host (for all partitions)

numHosts --> 4               |8               |12              |16              |20              |
numHours
24 --------> 320             |160             |128             |96              |64              |
48 --------> 160             |80              |64              |48              |32              |
72 --------> 110             |55              |44              |33              |22              |
96 --------> 80              |40              |32              |24              |16              |
n

Neha Pawar

11/30/2020, 9:46 PM
Hey @João Comini, thanks for sharing so many details
j

João Comini

11/30/2020, 9:53 PM
Yes, a lot, and my doubts still persist haha
n

Neha Pawar

11/30/2020, 9:54 PM
numHours indicates the number of hours the realtime segment will be in CONSUMING state. In this state, all the ingested data is in memory. Periodically, based on thresholds, the data gets converted to a completed segment and flushed onto disk. Now this numHours should be set based on a few factors. 1. retention of your kafka stream. If your kafka topic retains data for 24h, then you don’t want to be setting the numHours in Pinot more than 24h. If the pinot-server gets restarted, it has to reconsume everything from the last checkpoint. And pinot will rely on the kafka stream to have all that data. 2. You want to keep the numHours reasonably low. The more the segment consumes, the bigger the segment it needs to create, and segment creation is memory-intensive. In case of pinot-server restarts, the server has to reconsume everything, so again, resonably low numHours is desired.
Typically, we don’t recommend increasing this more than 24h
j

João Comini

11/30/2020, 9:57 PM
Oh, right, got it! What about the number of segments queried? If numHours is low, i will have a lot more segments right?
n

Neha Pawar

11/30/2020, 9:58 PM
btw, if you’re fairly new to the realtime segment concepts, this video might help in making sense of some of these terms

https://youtu.be/WoruCQgPhSA

j

João Comini

11/30/2020, 9:58 PM
Thanks! I'll take a look :)
s

Subbu Subramaniam

11/30/2020, 9:59 PM
@João Comini all very good questions. Rows are stored in uncompressed format when the segment is consuming, but are compressed after it is completed. So, if you consume for longer time, you take in more volatile memory. Also, like Neha mentioned, if you need to restart the server, the rows are consumed from the start of the segment again.
n

Neha Pawar

11/30/2020, 9:59 PM
yes that is true. But looks like the max in your case is 320?
s

Subbu Subramaniam

11/30/2020, 10:00 PM
On the other hand, if you set the numhours to be too low, then as you pointed out, you get too many segments. That can be bad for query processing, esp in high qps use cases.
n

Neha Pawar

11/30/2020, 10:00 PM
which is quite small
s

Subbu Subramaniam

11/30/2020, 10:04 PM
All the segments still within retention period are in memory (mapped), as are the consuming segments. That is the total mapped memory. the active memory is estimated as the most recent 768 hours of data (as specified by you in the command line).
j

João Comini

11/30/2020, 10:05 PM
Nice! Thank you guys.
👍 1
🙂 1
Oh, i see. Right, i'll watch Neha's video and see if i get more doubts.
n

Neha Pawar

11/30/2020, 10:14 PM
that video is only for beginner concepts about realtime consumption and segments. Does not cover the provisioning helper, but please watch it regardless hah 🙂
Also unrelated, if your ingestion rate is only 4, do you really need 20 partitions?
the partitioning factor can also help with concerns about too many segments
j

João Comini

11/30/2020, 10:17 PM
Yes, this one of my concerns. The company where i work is huge, and we kind of need this solution working quicly. The configuration of this topic is not in control of my team, so we would need a middleman (Flink, Heron, etc), but we don't have that much time hahaha
We could create a simple Java application too, that does this work of moving from one topic to another
Oh, one more thing: by
mapped
you mean that the segments aren't in memory right? The segments are in the server's disk and mapped in memory, am i missing something?
I'm asking this because I want to know how much disk space i need to get for my servers.
s

Subbu Subramaniam

11/30/2020, 10:28 PM
yes, by "mapped" it means that the files in disk are mapped using
mmap
j

João Comini

11/30/2020, 10:30 PM
Right, now i feel much more comfortable haha
Last one (really, i promise): and what about resources requests and limits in kubernetes? If the active memory + consuming memory use is about 8G, how much extra memory would i need to the off-heap computations?
Should i ask kubernetes for 20G and set XMS and XMX to 16G for safety? (Just a hypothetical example)
n

Neha Pawar

11/30/2020, 11:01 PM
good question. @Xiang Fu any recommendations based on this ^^
x

Xiang Fu

11/30/2020, 11:06 PM
I would put 4 cpu(if you have high qps use case, increase this) and 32gb ram for request/limit and make -xmx -xmx both to 16g
n

Neha Pawar

11/30/2020, 11:09 PM
curious on how you came up with these @Xiang Fu ?
j

João Comini

11/30/2020, 11:09 PM
Me too haha
x

Xiang Fu

11/30/2020, 11:16 PM
that’s the t3.2xlarge machine size
8cpu,32gb ram
in general I recommend containers with more ram
so I put the ratio to 4cpu/32gb ram
if you can pick memory optimized sku like r5.xlarge, that’s the exact fit
j

João Comini

11/30/2020, 11:21 PM
hmmm, that's something that i'll need to talk about here, our nodes are all m5.xlarge
x

Xiang Fu

11/30/2020, 11:43 PM
hmm, then I would just do 4 cpu/16gb ram for container and set -xmx and -xms to both 8gb.
also I would suggest to use bigger machines like 2xlarge or 4xlarge
j

João Comini

12/01/2020, 1:25 PM
All right. Thank you so much guys! U're awesome! ❤️
I'm back hehe And what about offline servers SKUs, are there any recommendations? What's the priority in this case?
x

Xiang Fu

12/01/2020, 7:06 PM
Typically we recommend memory optimized SKUs. Offline servers typically has less cpu pressure comparing to realtime servers
meanwhile we recommend to enable tag based tier so that Pinot will move persisted segments from realtime servers to offline servers. See: https://docs.pinot.apache.org/operators/operating-pinot/tuning/realtime#moving-completed-segments-to-different-hosts and https://docs.pinot.apache.org/operators/operating-pinot/tiered-storage
j

João Comini

12/09/2020, 1:51 PM
@Rhafik Gonzalez