We are working on offline segment ingestion. Curre...
# general
w
We are working on offline segment ingestion. Currently we are using the TarPush. But its problem is that the controller need get involved with the data path by downloading the segment. Just curious, how does metadata push prevent the controller getting involved with data path?
k
With metadata push, you give the controller the URI of where the segment is located. This is used to update Zookeeper state, and (if needed) will trigger a download by the server processes. Which is why, when doing metadata push, you need to have your “deep store” location for segments be a shared file system (S3, HDFS, etc) that all the servers can access.
w
@User Thanks
@User ^^
e
Yep, but in order to get the metadata SegmentPushUtils.sendSegmentUriAndMetadata downloads the segment, extracts the metadata and only uploads the metadata.
From the code
Copy code
/**
   * This method takes a map of segment downloadURI to corresponding tar file path, and push those segments in metadata mode.
   * The steps are:
   * 1. Download segment from tar file path;
   * 2. Untar segment metadata and creation meta files from the tar file to a segment metadata directory;
   * 3. Tar this segment metadata directory into a tar file
   * 4. Generate a POST request with segmentDownloadURI in header to push tar file to Pinot controller.
   *
   * @param spec is the segment generation job spec
   * @param fileSystem is the PinotFs used to copy segment tar file
   * @param segmentUriToTarPathMap contains the map of segment DownloadURI to segment tar file path
   * @throws Exception
   */
Atleast in pinot 0.8.0 ^^^. Did it change in a newer version of pinot?
w
Is SegmentPushUtils.sendSegmentUriAndMetadata called outside the controller?
e
Looks like it's called from the ingestion jobs
w
segment download here is not happening in the controller.
e
Only place I see it called is from ingestion jobs
But the upload segment call happens on the controller: PinotSegmentUploadDownloadRestletResource
k
1. The metadata extraction & “push” to the controller happens via ingestion jobs (standalone, spark, hadoop). 2. Yes, currently the entire segment is downloaded to wherever the ingestion job is running, and the two required files are extracted/turned into a tarball and then pushed to the controller. 3. I have a PR (about to submit) that does a streaming extract of the metadata files (about 20K), so the entire file doesn’t have to be downloaded.
👍 1
e
Makes sense, thanks @User!