Question about batch import job. When running a La...
# general
k
Question about batch import job. When running a LaunchDataIngestionJob, I see the S3-based file(s) being ingested are being copied first to a temp directory on my local machine. Assuming I’ve set up a k8s-based cluster via EKS, is there a way to ingest directly from S3? I see to recall some option to do this, which would be much more efficient.
x
The main motivation for that is that pinot will read the file twice and will rewind the input stream. It could cause issue when we have no knowledge about the upstream. So we tried to always copy the file to local for this. It’s also doubtable that reading directly from remote storage is efficient.
We can add an option to allow directly set input file path and you can compare the results
k
OK, thanks. So in my situation, where I’m running this LaunchDataIngestionJob to pull in lots of big files from S3 to a k8s-based cluster running in AWS, what’s going to be most efficient currently? I guess I could spin up another beefy EC2 instance, and run the command from that server versus my (home office) laptop.
x
hmm, how many segments you have ? One thing you can try is to start a pinot-ingestion job as a k8s batch job, so you can give resources for the container. Here is one example: https://github.com/fx19880617/pinot-meetup-demo/blob/master/covid19/covid19-recovered-global-ingestion-job.yaml#L126
Typically we want to avoid copy s3 data to your local laptop. It's a good idea to have an ec2 instance and run command from that .
k
Yes, exactly.