Question about batch import job. When running a LaunchDataIngestionJob, I see the S3-based file(s) being ingested are being copied first to a temp directory on my local machine. Assuming I’ve set up a k8s-based cluster via EKS, is there a way to ingest directly from S3? I see to recall some option to do this, which would be much more efficient.
11/30/2020, 9:32 PM
The main motivation for that is that pinot will read the file twice and will rewind the input stream. It could cause issue when we have no knowledge about the upstream. So we tried to always copy the file to local for this. It’s also doubtable that reading directly from remote storage is efficient.
We can add an option to allow directly set input file path and you can compare the results
11/30/2020, 10:09 PM
OK, thanks. So in my situation, where I’m running this LaunchDataIngestionJob to pull in lots of big files from S3 to a k8s-based cluster running in AWS, what’s going to be most efficient currently? I guess I could spin up another beefy EC2 instance, and run the command from that server versus my (home office) laptop.