Hi All ,
I am trying to run ingestion on EMR using the spark submit and need to ingest around 9000 parquet files approx 100 mb each, also setup a staging folder on s3 with jobType : SegmentCreationAndMetaDataPush and segmentCreationJobParallelism = 3
The issue is that the job runs for a very long period of time (more than 6 hours) without spitting out any segments (stg or output folder)
Question 1: Does it usually take longer to ingest parquet files ? I am using 10 * m5d.12xlarge clusters
Question 2: How to i setup a consistent logging channel for EMR
Attaching the IngestionSpecFile for more ref , EMR specs are listed below
EMR
m5d.12xlarge(Primary) + 9 * m5d.12xlarge(Core)
48 vCore, 192 GiB memory, 1800 SSD GB storage
emr-7.1.0
Installed apps:
Hadoop 3.3.6, Hive 3.1.3, JupyterEnterpriseGateway 2.6.0, Livy 0.8.0, Spark 3.5.0