Akash
05/07/2021, 9:14 PMexecutionFrameworkSpec:
name: 'spark'
segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentGenerationJobRunner'
segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentTarPushJobRunner'
segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentUriPushJobRunner'
extraConfigs:
stagingDir: '<hdfs://hadoop/tmp/pinot_staging/>'
jobType: SegmentCreationAndTarPush
inputDirURI: '<hdfs://hadoop/hp/input/Event1/dateid=2020-12-30/>'
outputDirURI: '<hdfs://hadoop/pinot/output/Event1/dateid=2020-12-30/>'
Now this generates the segment on pinot/output/Event1/dateid=2020-12-30/
I have Pinot deepstorage on HDFS where controller data
/hp/pinot/data/controller/Event1/
Currently AFAIU, The data is moved from HDFS => Pinot Controller => HDFS. Is there a way to short circuit the whole network process ?
I can see there is configuration in Table where we can specify batchIngestionConfig=>segmentIngestionType as REFESH. Though, there is no example anywhere, do we have any test in codebase or some blog/docs e.t.cMayank
Mayank
Mayank
RK
05/08/2021, 6:10 AM