Hi All Can anyone explain how we can use RealtimeProvisionin Apache Pinot #getting-started

Hi All, Can anyone explain how we can use 'Realtim...

Mamlesh

10/17/2022, 4:35 AM

Hi All, Can anyone explain how we can use 'RealtimeProvisioningHelper' As i am using it with complete segment got some error 'java.lang.OutOfMemoryError: Direct buffer memory' everytime. command i've used 'sh pinot-admin.sh sh pinot-admin.sh RealtimeProvisioningHelper -tableConfigFile=/home/mamlesh/pinot/ingestData/massRealtimeTable.json -numPartitions=1 -pushFrequency=Append -numHosts=1,2,3 -numHours=1,2,3,4,5,6,7,8,9,10,11,12 -sampleCompletedSegmentDir=/disk1/pinotData/server/index/MassDataTableIR_REALTIME/MassDataTableIR__8__3__20221012T1842Z -ingestionRate=500 -maxUsableHostMemory=18G -retentionHours=168 ' is there something wrong in command. i was not able to find documentaion for 'RealtimeProvisioningHelper'.

Navina

10/17/2022, 5:30 AM

@Mamlesh looks like this command is using a deprecated help annotation and that is why it is not printing the usage. Here it is for your convenience:

Copy code

Usage: <main class> RealtimeProvisioningHelper [-h]
       -ingestionRate=<_ingestionRate>
       [-maxUsableHostMemory=<_maxUsableHostMemory>] [-numHosts=<_numHosts>]
       [-numHours=<_numHours>] -numPartitions=<_numPartitions>
       [-numRows=<_numRows>] [-pushFrequency=<_pushFrequency>]
       [-retentionHours=<_retentionHours>]
       [-sampleCompletedSegmentDir=<_sampleCompletedSegmentDir>]
       [-schemaWithMetadataFile=<_schemaWithMetadataFile>]
       -tableConfigFile=<_tableConfigFile>
  -h, --h, -help, --help
      -ingestionRate=<_ingestionRate>
                            Avg number of messages per second ingested on any
                              one stream partition (assumed all partitions are
                              uniform)
      -maxUsableHostMemory=<_maxUsableHostMemory>
                            Maximum memory per host that can be used for pinot
                              data (e.g. 250G, 100M). Default 48g
      -numHosts=<_numHosts> number of hosts as comma separated values (default
                              2,4,6,8,10,12,14,16)
      -numHours=<_numHours> number of hours to consume as comma separated
                              values (default 2,3,4,5,6,7,8,9,10,11,12)
      -numPartitions=<_numPartitions>
                            number of stream partitions for the table
      -numRows=<_numRows>   Number of rows to be generated based on schema with
                              metadata file
      -pushFrequency=<_pushFrequency>
                            Frequency with which offline table pushes happen,
                              if this is a hybrid table
                            	(hourly,daily,weekly,monthly). Do not specify if
                              realtime-only table
      -retentionHours=<_retentionHours>
                            Number of recent hours queried most often
                            	(-pushFrequency is ignored)
      -sampleCompletedSegmentDir=<_sampleCompletedSegmentDir>
                            Consume from the topic for n hours and provide the
                              path of the segment dir after it completes
      -schemaWithMetadataFile=<_schemaWithMetadataFile>
                            Schema file with extra information on each column
                              describing characteristics of data
      -tableConfigFile=<_tableConfigFile>

I will fix in the code shortly.

Navina

10/17/2022, 5:30 AM

are you trying to estimate for realtime table ?

Mamlesh

10/17/2022, 5:31 AM

yes @Navina

Navina

10/17/2022, 5:31 AM

I do see documentation here - https://docs.pinot.apache.org/operators/operating-pinot/tuning/realtime#realtimeprovisioninghelper

👍 1

Navina

10/17/2022, 5:31 AM

I haven't read it. so not sure if it is aligned with the code 😛

Navina

10/17/2022, 5:32 AM

but just looking at your command, for realtime table, looks like values for

pushFrequency

is invalid. It is getting ignored because

retentionHours

is provided in the command.

Mamlesh

10/17/2022, 6:06 AM

why am i getting this kind of results. Any idea. ============================================================ RealtimeProvisioningHelper -tableConfigFile /home/mamlesh/pinot/ingestData/massRealtimeTable.json -numPartitions 10 -pushFrequency null -numHosts 2,3,4,5 -numHours 2 -sampleCompletedSegmentDir /disk1/pinotData/server/index/MassDataTableIR_REALTIME/MassDataTableIR__0__0__20221012T1824Z -ingestionRate 1000 -maxUsableHostMemory 15G -retentionHours 24 Note: * Table retention and push frequency ignored for determining retentionHours since it is specified in command * See https://docs.pinot.apache.org/operators/operating-pinot/tuning/realtime Memory used per host (Active/Mapped) numHosts --> 2 |3 |4 |5 | numHours 2 --------> NA |NA |NA |NA | Optimal segment size numHosts --> 2 |3 |4 |5 | numHours 2 --------> NA |NA |NA |NA | Consuming memory numHosts --> 2 |3 |4 |5 | numHours 2 --------> NA |NA |NA |NA | Total number of segments queried per host (for all partitions) numHosts --> 2 |3 |4 |5 | numHours 2 --------> NA |NA |NA |NA |

Navina

10/17/2022, 6:10 AM

Not sure. let me read the code. Meanwhile, I found some usage notes in the code that is also not getting printed :

Copy code

Usage: RealtimeProvisioningHelperCommand


This command allows you to estimate the capacity needed for provisioning realtime hosts. It assumes that there is no upper limit to the amount of memory you can mmap
If you have a hybrid table, then consult the push frequency setting in your offline table specify it in the -pushFrequency argument
If you have a realtime-only table, then the default behavior is to assume that your queries need all data in memory all the time
However, if most of your queries are going to be for (say) the last 96 hours, then you can specify that in -retentionHours
Doing so will let this program assume that you are willing to take a page hit when querying older data
and optimize memory and number of hosts accordingly.

Mamlesh

10/17/2022, 6:17 AM

Hi, -maxUsableHostMemory is this volatile memroy (RAM) or non-volatile memory (secondary disk).

Navina

10/17/2022, 6:28 AM

I think it uses direct memory allocation and hence, using native memory and not disk. @Mayank / @Neha Pawar to confirm

Mamlesh

10/17/2022, 7:18 AM

Hi,

ingestionRate

: Specify the average number of rows ingested per second per partition of your stream. 1. stream means kafka , as im using kafka right? 2. total records = numPartions x ingestionRate right?

Mamlesh

10/19/2022, 6:00 PM

Hi @Navina did you get any answer about '-maxUsableHostMemory'

Neha Pawar

10/19/2022, 7:25 PM

@Sajjad Moradi ^^

Sajjad Moradi

10/20/2022, 4:36 AM

That's memory, not disk. If you get NA, it's usually because maxUsableHostMemory is not enough. I think we should remove this parameter. The tool outputs the value and user can see if the required memory is more than actual memory size. For now, you can set a very high number for that parameter

Mamlesh

10/20/2022, 4:48 AM

Then at output whats the diffence b/w Memory used per host and Consuming memory per host. As per my understanding memory used per host is for completed segment which is saved on disk so its indirectly that amount size of disk required. And consuming memory is RAM usage till segment is in consuming state. Please correct me if any gap in my undestanding.

Mamlesh

10/20/2022, 9:39 AM

Hi @Sajjad Moradi, please clearout my doubts, Is thier any gap in my understanding.

Sajjad Moradi

10/21/2022, 5:02 PM

Consuming segments need memory (RAM), but completed segments use mmap. So completed segments can get paged in to memory from disk when the queries are routed to them. That means memory requirement for consuming segments is a must-have, but for completed segments, because of mmap & automatic page-in, memory requirement is loose. The more memory (RAM) you allocate, the less page-in you'll have which makes query execution faster.

Open in Slack

Previous Next