Hi - If I have a large set of S3 files in a object...
# getting-started
b
Hi - If I have a large set of S3 files in a object store, what is a good way to run query analysis on it? Can Pinot be used to solve this use case? The S3 files aren't fully static...they can change from time to time. Would love to be pointed at something
I came across this tutorial: https://docs.pinot.apache.org/users/tutorials/ingest-parquet-files-from-s3-using-spark . But wondering 1) What happens when the S3 files change data? Do we just re-run the Spark jobs to re-ingest the data on some cadence...there will be times when the data is stale, no? 2) In this tutorial, can Spark be replaced with say Flink ?
m
If the S3 files change you could manually delete the segments from Pinot and reingest the data. Depending on what the data looks like the segments might actually get automatically replaced.
At the moment batch ingestion supports Spark/Hadoop/standalone, so not Flink for now. https://docs.pinot.apache.org/basics/data-import/batch-ingestion