i’m trying to import at least 2 years worth of dat...
# getting-started
l
i’m trying to import at least 2 years worth of data I was looking to see if I could get some guidance on how to go about this, I have been taking a look at the ingestion job framework, is this the way to go about this? what are some of the considerations we have to make when doing this backfills. I see that data is divided by folders which are the days and each of these days will be a segment on pinot, is that right? how do we ensure that the data we are ingesting will still perform well? and what are some of the tips that you could give when moving a lot of data?
x
general guideline is to pre-partition data by date, then you will have multiple raw data files per day, and each data file will become one pinot segment, 1:1 mapping.
for ingestion, the segment creation and push are external process or you can start a set of nodes of pinot minions to do the job
that will not impact your runtime pinot servers
for data push, set the push parallelism to ensure you won’t exhaust pinot controller.
l
right as explained here, https://docs.pinot.apache.org/users/tutorials/batch-data-ingestion-in-practice and in short as you said each of those files will be a segment, how do i know my segment size is okay?
for each of the files
right now we have a hybrid model and this is hour configs for our current segments in the realtime side of it:
Copy code
"realtime.segment.flush.threshold.rows": "0",
      "realtime.segment.flush.threshold.time": "24h",
      "realtime.segment.flush.segment.size": "250M"
another question that i had is how these configs:
Copy code
"ingestionConfig": {
    "batchIngestionConfig": {
      "segmentIngestionType": "APPEND",
      "segmentIngestionFrequency": "HOURLY"
    }
  },
impact the offline table
m
Copy code
segmentIngestionType - Used for data retention
segmentIngestionFrequency - Used to compute time-boundary for hybrid tables
1
l
thank you mayank
also to explain our current setup we have this:
Copy code
realtime table with 7 days retention,
offline table with 2 years retention (realtime data is eventually moved here)
we want to backfill the offline table with data that is on the system that we are moving away from to pinot into the offline table, is this the way people usually do it or do we usually create another offline table that does backfilling only
m
You can backfill a hybrid table.