Is there anything I can do to make batch import faster It se Apache Pinot #general

Is there anything I can do to make batch import fa...

Aaron Wishnick

04/14/2021, 8:11 PM

Is there anything I can do to make batch import faster? It seems like most of the time is spent processing the Parquet files I'm importing, but I still don't see very high CPU usage on my machine (particularly, most cores are not busy). I see stuff like this in the logs:

Copy code

Apr 14, 2021 3:16:33 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: time spent so far 0% reading (1854 ms) and 99% processing (311813 ms)

Is there a setting to use more cores to process segments in parallel or anything like that?

Daniel Lavoie

04/14/2021, 8:13 PM

What about your disk IO?

Aaron Wishnick

04/14/2021, 8:16 PM

Looking at some system stats, Disk I/O seems really low: Writes on the order of 100 MB/sec, reads on the order of 8 MB/sec

Daniel Lavoie

04/14/2021, 8:17 PM

what kind of disks are we talking? to some extend, 100MB/sec could be a bottleneck

Aaron Wishnick

04/14/2021, 8:18 PM

Looking into that now! Good call

Aaron Wishnick

04/14/2021, 9:14 PM

This is an NVMe under a virtualization layer

Ken Krugler

04/15/2021, 3:42 PM

There’s

segmentCreationJobParallelism

in the job yaml file that should be set to the number of cores you’ve got. Though depending on your table definition (e.g. is

createInvertedIndexDuringSegmentGeneration

set true) you might run out of memory if your parallelism is too high.

Ken Krugler

04/15/2021, 3:43 PM

Though the fastest way to build segments is to run it as Hadoop or Spark job, on a sufficiently large cluster.

Aaron Wishnick

04/15/2021, 4:56 PM

Thanks! You're saying to parallelize by segment and run many smaller ingestion jobs? That's my next step, I suppose 🙂

Ken Krugler

04/15/2021, 4:57 PM

If you’re asking about my last comment, no - Pinot comes with support for running a Hadoop map-reduce or Spark job to build segments in parallel, using your Hadoop or Spark cluster (which usually has many servers).

Ken Krugler

04/15/2021, 4:58 PM

But for many tables, it’s OK to build on a beefy server (e.g. with 24 cores) and just the regular Pinot segment build job, provided you set the parallelism high enough.

Ken Krugler

04/15/2021, 4:59 PM

As part of that, you want to specify using something like HDFS as the destination, so that you can then “push” the segments to the Pinot controller by sending URIs, which are then downloaded by (multiple) Pinot server processes. Versus pushing segments through the controller, which is much slower.

Aaron Wishnick

04/15/2021, 5:22 PM

Ok, thanks! Believe it or not I don't have Hadoop or Spark set up. What do I have to do in order to push segments to the controller by sending URIs? I'm using S3 as the destination right now

Ken Krugler

04/15/2021, 5:42 PM

I’d have one job file that builds the segments (results are in S3), and then a second job file that is configured to send URIs to the controller. I’m in a mtg now, but could look up the job config for that later…

Aaron Wishnick

04/15/2021, 6:42 PM

Thanks!

Ken Krugler

04/15/2021, 6:50 PM

So you first want to run the SegmentCreation job, with the outputdir in S3. Then run the SegmentUriPush job, with the same output dir (it’s a little confusing, that job uses files found in the output dir to build the list of URIs to send to the controller).

Ken Krugler

04/15/2021, 6:52 PM

This assumes your Pinot cluster is configured to be able to read files from S3 (credentials, plugins). I’m using HDFS so I haven’t had to deal with S3 credential fun & games, but you can (IIRC) put this into the config files, though that’s a security risk.

Aaron Wishnick

04/15/2021, 6:59 PM

Thanks! The S3 setup is working so I'll try this.

Aaron Wishnick

04/15/2021, 7:00 PM

So why is this faster? Is SegmentUriPush slow and that's why it's advantageous to be able to run it on multiple servers rather than to bottleneck on the coordinator?

Ken Krugler

04/15/2021, 7:13 PM

Pushing URIs is faster than pushing Tars, since you’re sending URIs to the controller, which get distributed to the servers, and read by N servers (in parallel).

Ken Krugler

04/15/2021, 7:13 PM

SegmentUriPush is fast

Ken Krugler

04/15/2021, 7:14 PM

The two slow things are (a) building the segment, and (b) pushing segments (tars) to the cluster via the controller

Ken Krugler

04/15/2021, 7:15 PM

So fastest is to build the segments in a distributed environment (Hadoop or Spark), or at least make sure you’ve got max parallelism on the one server where you’re running the segment generation job.

Ken Krugler

04/15/2021, 7:15 PM

and then save the segments in a distributed store (like HDFS or S3), so you can just send URIs to the cluster

Aaron Wishnick

04/15/2021, 7:40 PM

Thanks!

Aaron Wishnick

04/15/2021, 7:41 PM

Just out of curiosity -- why is it slow to push tars via the controller? Does it entail more than just the I/O?

Aaron Wishnick

04/15/2021, 7:43 PM

I'm just trying to build out my mental model of what's going on during data ingestion

Aaron Wishnick

04/15/2021, 8:00 PM

And actually -- is there a way I can measure how much time my batch ingest jobs are spending pushing segments via the controller right now, to see if that's my bottleneck?

Ken Krugler

04/15/2021, 8:15 PM

If you watch the output of running the job, you’ll see when it starts pushing segments.

Ken Krugler

04/15/2021, 8:17 PM

re pushing tars - you’ve got a single process (the Controller) needing to receive all of the segments over HTTP, and then turn around and send them to the various server processes. IIRC there might also be a download from S3 to the server where the batch job is running (though that would be silly, I know 🙂).

Open in Slack

Previous Next