Mamlesh
10/17/2022, 4:35 AMNavina
10/17/2022, 5:30 AMUsage: <main class> RealtimeProvisioningHelper [-h]
-ingestionRate=<_ingestionRate>
[-maxUsableHostMemory=<_maxUsableHostMemory>] [-numHosts=<_numHosts>]
[-numHours=<_numHours>] -numPartitions=<_numPartitions>
[-numRows=<_numRows>] [-pushFrequency=<_pushFrequency>]
[-retentionHours=<_retentionHours>]
[-sampleCompletedSegmentDir=<_sampleCompletedSegmentDir>]
[-schemaWithMetadataFile=<_schemaWithMetadataFile>]
-tableConfigFile=<_tableConfigFile>
-h, --h, -help, --help
-ingestionRate=<_ingestionRate>
Avg number of messages per second ingested on any
one stream partition (assumed all partitions are
uniform)
-maxUsableHostMemory=<_maxUsableHostMemory>
Maximum memory per host that can be used for pinot
data (e.g. 250G, 100M). Default 48g
-numHosts=<_numHosts> number of hosts as comma separated values (default
2,4,6,8,10,12,14,16)
-numHours=<_numHours> number of hours to consume as comma separated
values (default 2,3,4,5,6,7,8,9,10,11,12)
-numPartitions=<_numPartitions>
number of stream partitions for the table
-numRows=<_numRows> Number of rows to be generated based on schema with
metadata file
-pushFrequency=<_pushFrequency>
Frequency with which offline table pushes happen,
if this is a hybrid table
(hourly,daily,weekly,monthly). Do not specify if
realtime-only table
-retentionHours=<_retentionHours>
Number of recent hours queried most often
(-pushFrequency is ignored)
-sampleCompletedSegmentDir=<_sampleCompletedSegmentDir>
Consume from the topic for n hours and provide the
path of the segment dir after it completes
-schemaWithMetadataFile=<_schemaWithMetadataFile>
Schema file with extra information on each column
describing characteristics of data
-tableConfigFile=<_tableConfigFile>
I will fix in the code shortly.Navina
10/17/2022, 5:30 AMMamlesh
10/17/2022, 5:31 AMNavina
10/17/2022, 5:31 AMNavina
10/17/2022, 5:31 AMNavina
10/17/2022, 5:32 AMpushFrequency
is invalid. It is getting ignored because retentionHours
is provided in the command.Mamlesh
10/17/2022, 6:06 AMNavina
10/17/2022, 6:10 AMUsage: RealtimeProvisioningHelperCommand
This command allows you to estimate the capacity needed for provisioning realtime hosts. It assumes that there is no upper limit to the amount of memory you can mmap
If you have a hybrid table, then consult the push frequency setting in your offline table specify it in the -pushFrequency argument
If you have a realtime-only table, then the default behavior is to assume that your queries need all data in memory all the time
However, if most of your queries are going to be for (say) the last 96 hours, then you can specify that in -retentionHours
Doing so will let this program assume that you are willing to take a page hit when querying older data
and optimize memory and number of hosts accordingly.
Mamlesh
10/17/2022, 6:17 AMNavina
10/17/2022, 6:28 AMMamlesh
10/17/2022, 7:18 AMingestionRate
: Specify the average number of rows ingested per second per partition of your stream.
1. stream means kafka , as im using kafka right?
2. total records = numPartions x ingestionRate right?Mamlesh
10/19/2022, 6:00 PMNeha Pawar
Sajjad Moradi
10/20/2022, 4:36 AMMamlesh
10/20/2022, 4:48 AMMamlesh
10/20/2022, 9:39 AMSajjad Moradi
10/21/2022, 5:02 PM