Hi Team, We have a dimension table and we want to ...
# general
s
Hi Team, We have a dimension table and we want to replace the data of the table on regular basis, regards to which we have the following questions: 1. On checking the documentation we understand that we can provide a fixed name to a particular segment Eg: SEGMENT_NAME How can we achieve the same incase of multiple segments eg: SEGMENT_NAME_1 , SEGMENT_NAME_2 2. For OFFLINE tables is there a way to specify the number of records per segment, similar to realtime.segment.threshold.rows in REALTIME table.
m
for offline tables you would generally be creating the segment from a fixed size data source, so you can actually control that size by controlling the number of rows in the data source.
e.g. see this blog post https://www.markhneedham.com/blog/2022/01/19/apache-pinot-sorted-indexes-offline-tables/ I'm importing data from two CSV files and Pinot creates one segment per CSV file
m
For 1, I think it will append the count suffix by default if you have multiple segments being generated in the same job
m
Hi @User, For 1, we have created 5 different CSVs and we have deployed a Dimension table so the segmentIngestionType is "REFRESH". We have defined "Fixed" in the SegmentNameGeneratorSpec for the segment name. When I am trying to Push these CSVs, the expectation was that it would generate 5 segments following the pattern: SEGMENT_1, SEGMENT_2, SEGMENT_3 etc., However it is creating only 1 segment with name SEGMENT and it is replacing the previously pushed segment.
m
Are the csv’s in the same folder? If not, then you want to put them in single folder
m
Yes the CSVs are in the same folder
m
Can you disable setting the segment name generator spec and try
m
If I disable the setting and push the segments I see 5 segments getting created. However, I want the segments to have a user defined fixed name.
Hi @User, Is this a bug, should I go ahead and raise a issue?
m
What is the naming scheme of segments if you disable the setting?
@User
m
If I disable the setting the naming scheme followed is 'Simple' - This is the default scheme So with that if I have 5 CSVs file, I can see 5 different segments getting created
Okay I checked the code and looks like this is a bug to me: generateSegmentName function in FixedSegmentNameGenerator,java https://github.com/apache/pinot/blob/3f93cfb93a386ee3e8bd108ffe34814b734575e9/pino[…]e/pinot/segment/spi/creator/name/FixedSegmentNameGenerator.java generateSegmentName function in SimpleSegmentNameGenerator.java https://github.com/apache/pinot/blob/3f93cfb93a386ee3e8bd108ffe34814b734575e9/pino[…]/pinot/segment/spi/creator/name/SimpleSegmentNameGenerator.java Observations: 1. In case of FixedNameGenerator we are not appending the sequenceId to the segmentName, thus every time a same segment name is getting created 2. In case of Simple Segment Generator apart from the normal convention(Generating the segment name from the table name etc., ) we are also appending the sequenceId. That is why the segment names are getting sequence numbers if we have more than one CSVs files being pushed Is the observation right? @User
Raised as issue for the same: https://github.com/apache/pinot/issues/8068
m
Can you give me the sample name of segment without the setting? Also do you have time column in your data (refresh is not expected to have it).
@User afaik all refresh use cases have the count suffix segment name. Is that available in OSS too?
m
Segment Name: "SEGMENT" Yes we have the time column in the our table
So after removing the time column and going with the Simple SegmentNameGenerator scheme we are getting the right segment name, the names of the segment are getting generated as per the table name appended with the sequenceId Thanks for pointing this out.
However, I see still this an issue: https://github.com/apache/pinot/issues/8068
j
@User yes, that’s also available in OSS. Sequence ID (i.e. the suffix) can be passed into
SegmentGeneratorConfig
class: https://github.com/apache/pinot/blob/7cbd5ac21dc0c5c3c4ea007edb1980723c90e393/pino[…]rg/apache/pinot/segment/spi/creator/SegmentGeneratorConfig.java
m
Thanks @User. I guess it was because of refresh + time column
Do you mind looking at the issue above?
j
looking into it now
Ah yes,
FixedSegmentNameGenerator
won’t respect the sequence id. In LinkedIn we use the
NormalizedDateSegmentNameGenerator
, @User sth that you can try
This is the place where segment name generator is chosen for LinkedIn use cases: https://github.com/apache/pinot/blob/611f3b11b1336bc9f426ead5ba908a9632a50fd9/pino[…]/org/apache/pinot/hadoop/job/mappers/SegmentCreationMapper.java Both
SimpleSegmentNameGenerator
and
NormalizedDateSegmentNameGenerator
will pick up the sequence id correctly