Hi All , I am trying to run ingestion on EMR usin...
# pinot-perf-tuning
s
Hi All , I am trying to run ingestion on EMR using the spark submit and need to ingest around 9000 parquet files approx 100 mb each, also setup a staging folder on s3 with jobType : SegmentCreationAndMetaDataPush and segmentCreationJobParallelism = 3 The issue is that the job runs for a very long period of time (more than 6 hours) without spitting out any segments (stg or output folder) Question 1: Does it usually take longer to ingest parquet files ? I am using 10 * m5d.12xlarge clusters Question 2: How to i setup a consistent logging channel for EMR Attaching the IngestionSpecFile for more ref , EMR specs are listed below EMR m5d.12xlarge(Primary) + 9 * m5d.12xlarge(Core) 48 vCore, 192 GiB memory, 1800 SSD GB storage emr-7.1.0 Installed apps: Hadoop 3.3.6, Hive 3.1.3, JupyterEnterpriseGateway 2.6.0, Livy 0.8.0, Spark 3.5.0
x
Pinot segment is one parquet file to one segment mapping. In your case it will be 9000 segments. For your emr, the job will first create all the segments then push. So you won't see any data in Pinot until all 9000 segments are created
Can you check output directory and see how many created segments
I would suggest to add more smaller workers for this emr job
s
Hi @Xiang Fu Thank you for your input 🙂 I was able to load the data but turns out enableDefaultStarTree=True was consuming a lot of EMR resources which was causing the ingestion to fail
👍 1