Hi, I am just exploring this project and have a q...
# general
p
Hi, I am just exploring this project and have a question on pinot-s3 data ingestion. At our company we have new data coming as json/csv files every minute/hour. We are currently using postgres which is hard to scale so we are looking for a performant, horizontally scalable OLAP solution ideally which runs on Kubernetes. My question is if it is possible to sync a S3 bucket with pinot? So, if we add new csv/json files to the bucket, pinot should automatically injest (only) new files into its segment store without any duplicates. I expect this is doable using S3 events but I couldn’t find if something like this is already in place. If not, then we have to cook up out own solution using S3 events or set up a kafka cluster to stream data to Pinot. Thanks!
s
You might want to look at the StarTree offering which does have a sync mode ingestion built using minions. https://www.startree.ai/blog/no-code-batch-ingestion-for-apache-pinot-in-startree-cloud#:~:text=to%20the%20table.-,S[…]20files,-are%20ingested%20for The OSS pinot doesn't have something like that as of yet. Like you said, you could build something similar using kafka, or using a custom minion task
m
If your data is landing every minute, the better pattern here would be to use stream ingestion. If you can find a way to push your data into Kafka / kinesis / any stream system, Pinot can directly ingest from the stream. This will make your data almost immediately available for serving