Hello Team, i wanted to understand one thing does ...
# general
p
Hello Team, i wanted to understand one thing does pinot provide capability of overwrite any segment data ( like we do overwrite partitioned data in hive table)
m
Yes, you can overwrite a segment simply by pushing it again.
The contents of the old segment are overwritten by the contents of the new segment. Segments are matched based on their name
p
I dont understand, let me explain my use case.. i have date partitioned parquet data in s3. Let say if i dump 10th june on 10th june and again i pushed 10th june on 11th june, i want that pinot overwrite the data and keep the latest data only
m
So when you generate segments, there's configurable segment naming convention that defaults to (table_name_minTime_maxTime_num).
for example myTable_2021-06-10_2021-06-10_0
When you regenrate the June 10th data on June 11th, the segment will still be named ^^. And when you push it to Pinot, it will overwrite the previously pushed segment of the same name
It is segment name based, not not really time or partition based.
p
Can you share some reference where i can see how to create segment name??
m
this is the default
p
@User few questions 1. What is the logic of creating segment name, some default postfix attached to segment name? 2. I kept two file in partitioned folder, two segments were created, then i removed one of file from partition folder, this time still two segments are present, one got replaced and other one(older) is still there.. i was expecting that both segment will be deleted and created new one.. 3. Job spec yaml contains both input and output uri, if i setup some pipeline to read every day data, then i need to modify the file again n again by creating some script.. i wanted to know is there automated way you handled at your side??
My use case is simple.. i am writing data to s3 , partitioned based on date.. everyday if i run pinot job, i want that all segments will be overwrite for each day..
m
Copy code
1. Yes, see <https://docs.pinot.apache.org/configuration-reference/job-specification#segment-name-generator-spec>.
2. Pinot uses crc check to figure out if data has changed. If you regenrate a segment and push again without changing the input, Pinot will see same crc and won't overwrite (as expected).
3. Not sure if I follow, what file do you need to modify again and again? May be check <https://docs.pinot.apache.org/basics/components/minion#starting-a-minion> that can be used to write scheduled tasks.

You can overwrite all segments every day. But note that if your retention is 1 year, you want to regenerate and push 1 years worth of segments each day?
p
Let me explain my scenario with example: I have two files (data is partitioned based on date ie. 2021-05-01:
Copy code
/container_data/examples/rawdata/2021-05-01/raw_data1.csv -- total records : 2
/container_data/examples/rawdata/2021-05-01/raw_data.csv -- total records : 1
First time when i ran pinot ingestion job, i can see two segments files were created
Copy code
/container_data/examples/segments/2021-05-01/dim_meta_eg_OFFLINE_2021-05-01_2021-05-01_1.tar.gz ---- total records : 2
/container_data/examples/segments/2021-05-01/dim_meta_eg_OFFLINE_2021-05-01_2021-05-01_0.tar.gz -- total records : 1
Now my input data can be changed, lets assume in next run, i have only single file for 2021-05-01
Copy code
/container_data/examples/rawdata/2021-05-01/raw_data.csv -- total records : 3
when i tried to load this file again, i am expecting that all segments should be overwrite but its not happening Actual: i still see one of segments were overwrite but other one still remain the same:
Copy code
/container_data/examples/segments/2021-05-01/dim_meta_eg_OFFLINE_2021-05-01_2021-05-01_1.tar.gz ---- total records : 2 (from previous run)
/container_data/examples/segments/2021-05-01/dim_meta_eg_OFFLINE_2021-05-01_2021-05-01_0.tar.gz -- total records : 3( newer created)
But i was expecting like this
Copy code
/container_data/examples/segments/2021-05-01/dim_meta_eg_OFFLINE_2021-05-01_2021-05-01_0.tar.gz -- total records : 3
This is the use case i wanted to tested it out with pinot..
@User
m
Like I mentioned Pinot will overwrite the segment that was pushed again. It does not know that you want it to delete the other segment that already exists for that date. You’ll have to delete that explicitly, right now
p
Okay.. so in that case do you have any job which deleting segments??
m
Not at the moment, you can check the minion link I pasted above to see if you can use that. You can also file an issue with this request.
p
One more thing i observed, i deleted segments manually from Output directory folder, but still i can see data in query console. Did pinot keep data in memory ?
Do i need segment somewhere also?
m
You need to call the delete api, check swagger
p
I have seen that, When i deleted the segment using API, it doesn't delete .tar.gz file from output directory.. so what is the use of keeping tar file in output directory
m
This is likely because it is your output directory and not something controlled by Pinot. The dir controlled by Pinot is the dataDir that you specified for controller.