Hi team ,
I am trying to run ingestion job on EMR for about 87k parquet files with total size of 4.4 Tbi in a single s3 folder but I got timeout error on s3 list (error below, check attachment for detailed errors ). I believe the computation resources I assigned to Pinot and EMR cluster(used for spark-submit ingestion job) is adequate. The ingestion job works fine when it was used for loading smaller dataset.
24/09/25 145955 INFO S3PinotFS: Listed 8000 files from URI: s3://location/8000_files/, is recursive: true
24/09/25 150133 ERROR ApplicationMaster: Uncaught exception:
java.util.concurrent.TimeoutException: Futures timed out after [100000 milliseconds]
similar issue that
@somanath joglekar got on Sep last year. I have same issue he asked in the past
1. Is there a limit on memory when listing files for ingestion on s3
2. Is there a limit on number of files or size of files when trying to ingestion data
Can anyone help me with this question?