https://pinot.apache.org/ logo
#general
Title
# general
c

coco

05/12/2022, 8:32 AM
I have created a batch pipeline that stores datafiles from cloudera impala parquet table to pinot cluster. How to gracefully swap segments if the number of input files gets smaller? Like this: https://docs.pinot.apache.org/configuration-reference/job-specification#segment-name-generator-spec
segment.name.prefix : normalizedDate
exclude.sequence.id : false
-- input data file
Copy code
<hdfs://data/pinot_poc/input_table/yyyymmdd=20220512/data-file-0.parq>
<hdfs://data/pinot_poc/input_table/yyyymmdd=20220512/data-file-1.parq>
<hdfs://data/pinot_poc/input_table/yyyymmdd=20220512/data-file-2.parq>
-- pinot segment
Copy code
<hdfs://data/pinot_poc/controller/segments/pinot_table/batch_2022-05-12_2022-05-12_0>
<hdfs://data/pinot_poc/controller/segments/pinot_table/batch_2022-05-12_2022-05-12_1>
<hdfs://data/pinot_poc/controller/segments/pinot_table/batch_2022-05-12_2022-05-12_2>
------------------------------ If I redo the batch and the data file is reduced to two: -- input data file
Copy code
<hdfs://data/pinot_poc/input_table/yyyymmdd=20220512/data-file-0.parq>
<hdfs://data/pinot_poc/input_table/yyyymmdd=20220512/data-file-1.parq>
segment.name: fixed
If I have to use the 'segment.name:fixed' setting, how can I gracefully delete the segment 'batch_2022-05-12_2022-05-12_2'?
k

Ken Krugler

05/12/2022, 5:35 PM
It’s not atomic, but you could push the two new segments, and delete the third segment using the UI or REST API. Or for a real hack, create a segment with no data that has the same name as the third segment.
n

Neha Pawar

05/12/2022, 9:18 PM
if you can generate unique names every time (prolly wiht a prefix), you may be able to use the startSegmentReplace endSegmentReplace constructs. @Seunghyun is this somewhere we can use that?
c

coco

05/16/2022, 12:40 PM
@Ken Krugler @Neha Pawar My problem is that I can't predict the number of input files in advance. I've also considered deleting segments with REST API. To do this, the batch application must follow the steps below. 1. Count the number of input files and 2. Count the number of previous segments of the pinot and 3. The batch must find the name of the previous segment that is not overwritten. 4. After performing pinot ingestion 5. Delete the previous segment left in the pinot I'm a little worried that this kind of logic can cause problems. Also, between 3 and 4, the data is temporarily inaccurate. When creating a segment, is there a way to create one segment like 'batch_2022-05-12_2022-05-12'? Or is there a way to create a certain number of segments when creating segments?
k

Ken Krugler

05/16/2022, 4:03 PM
@coco with batch segment generation, at least when generating from CSV files, you get one segment per input file. I added support a while back for naming segments base on the input filename, so that you could control the final name based on what your upstream job generated for input data.