Hi, I have some doubts related to partitioning Fro...
# general
j
Hi, I have some doubts related to partitioning From the docs I am able to grasp the concept of how partitioning works when importing data through Kafka. But with respect to offline importing I am not very clear. What my current understanding and question is as follows. 1. If my input folder has 10 files. These individual files will be saved as individual segments following the replication config mentioned in the table config. 2. If I want to partition these files as and when it is getting imported as per this doc, it says the input files itself should be in partitioned state. How should this be configured in the table config? Is it similar to the partition config we set for realtime import like giving a columnPartitionMap with partition function and number of partition? If so I am a bit confused cause in offline import, segments created has a 1to1 mapping with the input files and if partition is involved the input files are already partitioned then what is the purpose of the partition config? Is it used for query routing alone?
k
@Rong R
r
segment partition config is not only use for direct segment file ingestion to offline table, for example RealtimeToOfflineSegmentsTask uses this config to transfer data from realtime to offline table.
1. could you share how you are ingesting these 10 files? are you directly uploading segment or running some pre-processing? 2. consider table config as the "desired" state; it doesn't define how to get to that state, so a. the question on "input file should be in partitioned state, how should this be configured in table config?" --> this is not configured in table config, this should be either (1) already partitioned; (2) part of your conversion job config; b. yes it is similar to realtime ingestion; but since real-time ingestion consumers (kafka and kinesis for example) are part of pinot, there's already logic to convert table config to kafka/kinesis config c. i think we already answered your last question, config defines the end state, it can be used for query, as well as input to conversion config. also if the data is not already partitioned, you could see how you can use batch ingestion to import the data: https://docs.pinot.apache.org/basics/data-import/batch-ingestion
j
I am making use of a minion task to ingest these files into Pinot. I have already partitioned the data based on a primary key of different organisations. Consider the 10 files are partitioned data of 10 orgs. Now without the partition config how will pinot know that these are partitioned data and if I want to use partitioned replica group segment assignment how will all these work in offline scenario?
r
ah. I see. which Minion task are you using to ingest? And just for me to understand correctly: the data is already partitioned, the goal here is not to repartition it again b/c it has already been partitioned, correct?
tagging @Xiaobing who have more context on minion tasks.
j
I am going to make use of this minion task - https://docs.pinot.apache.org/basics/components/minion#segmentgenerationandpushtask And yes my data is already partitioned so how will pinot understand this and save the file in pinot servers in appropriate partitions?
👍 1
x
did a quick check, I think this task may not be able to do partitioning when generating segments, e.g. this util class used by the task does not set partitionConfig. Potentially, if we pass the partitionConfig down to here, the task might be able to do partitioning while generating segments.
Copy code
SegmentGenerationTaskRunner.java: 

public String run()
      throws Exception {
...
    SegmentGeneratorConfig segmentGeneratorConfig = new SegmentGeneratorConfig(tableConfig, schema);
    segmentGeneratorConfig.setTableName(tableName);
    segmentGeneratorConfig.setOutDir(_taskSpec.getOutputDirectoryPath());
...
    // but it's missing this: segmentGeneratorConfig.setSegmentPartitionConfig();
We have an in-house minion task to generate offline segments using a more recent util called SegmentProcessorFramework which is in the OSS repo, and used by RealtimeToOffline task. That framework supports partitioning. So alternatively, we could extend segmentgenerationandpushtask to use SegmentProcessorFramework too.