Yes, we are trying to benchmark, Presto-Pinot comb...
# general
r
Yes, we are trying to benchmark, Presto-Pinot combination for 10GB to 200GB TPCH data. We want good query performance with fast loading capability. When compared with other OLAP dbs, Pinot seems to be taking long time for loading data. One observation is, the standalone job is using a single CPU(not sure how many threads) for a single upload job, even when there are having multiple files in import folder. Other OLAP dbs seem to load data using more than one CPU. Is there any setting to make Pinot import job use more than one CPU? Using HDFS ,S3 or GC is not in the scope of bench marking, We want to minimize the dependency on Hadoop or other Big systems because the data sizes we are targeting are not truly BigData terriroty.
k
Hi, you can either user Spark or Hadoop runner or increase push job parallelism in the ingestion spec
r
Can you please share the property that can be used to increase the parallelism. I am using the following spec executionFrameworkSpec: name: 'standalone' segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner' segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner' segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner' jobType: SegmentCreationAndTarPush inputDirURI: '/home/dbuser/DBGEN/100G/ldatadir/' includeFileNamePattern: 'glob:**/*' outputDirURI: '/home/dbuser/DBGEN/100G/lineitemsegments/' overwriteOutput: true pinotFSSpecs: - scheme: file className: org.apache.pinot.spi.filesystem.LocalPinotFS recordReaderSpec: dataFormat: 'csv' className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader' configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig' configs: header: 'L_ORDERKEY|L_PARTKEY|L_SUPPKEY|L_LINENUMBER|L_QUANTITY|L_EXTENDEDPRICE|L_DISCOUNT|L_TAX|L_RETURNFLAG|L_LINESTATUS|L_SHIPDATE|L_COMMITDATE|L_RECEIPTDATE|L_SHIPINSTRUCT|L_SHIPMODE|L_COMMENT' delimiter: '|' tableSpec: tableName: 'Lineitem' pinotClusterSpecs: - controllerURI: 'http://localhost:9000'
k
Copy code
pushJobSpec:
    pushParallelism: 1
    pushAttempts: 1
    pushRetryIntervalMillis: 1000
You can append this at the end and change the parallelism
If you want to speed up the overall job, a better way would be to use Spark or Hadoop Runner. You can check out our docs https://docs.pinot.apache.org/basics/data-import/batch-ingestion/spark This how to guide will also be helpful https://docs.pinot.apache.org/users/tutorials/ingest-parquet-files-from-s3-using-spark
r
Thanks, I will try the suggested configuration.