Question about batch import job When running a LaunchDataIng Apache Pinot #general

Question about batch import job. When running a La...

Ken Krugler

11/30/2020, 8:32 PM

Question about batch import job. When running a LaunchDataIngestionJob, I see the S3-based file(s) being ingested are being copied first to a temp directory on my local machine. Assuming I’ve set up a k8s-based cluster via EKS, is there a way to ingest directly from S3? I see to recall some option to do this, which would be much more efficient.

Xiang Fu

11/30/2020, 9:32 PM

The main motivation for that is that pinot will read the file twice and will rewind the input stream. It could cause issue when we have no knowledge about the upstream. So we tried to always copy the file to local for this. It’s also doubtable that reading directly from remote storage is efficient.

Xiang Fu

11/30/2020, 9:32 PM

We can add an option to allow directly set input file path and you can compare the results

Ken Krugler

11/30/2020, 10:09 PM

OK, thanks. So in my situation, where I’m running this LaunchDataIngestionJob to pull in lots of big files from S3 to a k8s-based cluster running in AWS, what’s going to be most efficient currently? I guess I could spin up another beefy EC2 instance, and run the command from that server versus my (home office) laptop.

Xiang Fu

11/30/2020, 10:37 PM

hmm, how many segments you have ? One thing you can try is to start a pinot-ingestion job as a k8s batch job, so you can give resources for the container. Here is one example: https://github.com/fx19880617/pinot-meetup-demo/blob/master/covid19/covid19-recovered-global-ingestion-job.yaml#L126

Xiang Fu

11/30/2020, 10:39 PM

Typically we want to avoid copy s3 data to your local laptop. It's a good idea to have an ec2 instance and run command from that .

Ken Krugler

11/30/2020, 11:05 PM

Yes, exactly.

3 Views

Open in Slack

Previous Next