Segment Loading Question: Currently I am loading ...
# general
a
Segment Loading Question: Currently I am loading data into Pinot via Spark job with following config:
Copy code
executionFrameworkSpec:
  name: 'spark'
  segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentGenerationJobRunner'
  segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentTarPushJobRunner'
  segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentUriPushJobRunner'
  extraConfigs:
    stagingDir: '<hdfs://hadoop/tmp/pinot_staging/>'
jobType: SegmentCreationAndTarPush
inputDirURI: '<hdfs://hadoop/hp/input/Event1/dateid=2020-12-30/>'
outputDirURI: '<hdfs://hadoop/pinot/output/Event1/dateid=2020-12-30/>'
Now this generates the segment on pinot/output/Event1/dateid=2020-12-30/ I have Pinot deepstorage on HDFS where controller data
Copy code
/hp/pinot/data/controller/Event1/
Currently AFAIU, The data is moved from HDFS => Pinot Controller => HDFS. Is there a way to short circuit the whole network process ? I can see there is configuration in Table where we can specify batchIngestionConfig=>segmentIngestionType as REFESH. Though, there is no example anywhere, do we have any test in codebase or some blog/docs e.t.c
m
Yes absolutely. You can just push the uri to controller
Let me find the doc
🙏 1
r
@User As You Mentioned you are using HDFS as deepatorage. Kindly share the reference docs. how to use hdfs as deepatorage?