Is there anything I can do to make batch import fa...
# general
a
Is there anything I can do to make batch import faster? It seems like most of the time is spent processing the Parquet files I'm importing, but I still don't see very high CPU usage on my machine (particularly, most cores are not busy). I see stuff like this in the logs:
Copy code
Apr 14, 2021 3:16:33 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: time spent so far 0% reading (1854 ms) and 99% processing (311813 ms)
Is there a setting to use more cores to process segments in parallel or anything like that?
d
What about your disk IO?
a
Looking at some system stats, Disk I/O seems really low: Writes on the order of 100 MB/sec, reads on the order of 8 MB/sec
d
what kind of disks are we talking? to some extend, 100MB/sec could be a bottleneck
a
Looking into that now! Good call
This is an NVMe under a virtualization layer
k
There’s
segmentCreationJobParallelism
in the job yaml file that should be set to the number of cores you’ve got. Though depending on your table definition (e.g. is
createInvertedIndexDuringSegmentGeneration
set true) you might run out of memory if your parallelism is too high.
Though the fastest way to build segments is to run it as Hadoop or Spark job, on a sufficiently large cluster.
a
Thanks! You're saying to parallelize by segment and run many smaller ingestion jobs? That's my next step, I suppose 🙂
k
If you’re asking about my last comment, no - Pinot comes with support for running a Hadoop map-reduce or Spark job to build segments in parallel, using your Hadoop or Spark cluster (which usually has many servers).
But for many tables, it’s OK to build on a beefy server (e.g. with 24 cores) and just the regular Pinot segment build job, provided you set the parallelism high enough.
As part of that, you want to specify using something like HDFS as the destination, so that you can then “push” the segments to the Pinot controller by sending URIs, which are then downloaded by (multiple) Pinot server processes. Versus pushing segments through the controller, which is much slower.
a
Ok, thanks! Believe it or not I don't have Hadoop or Spark set up. What do I have to do in order to push segments to the controller by sending URIs? I'm using S3 as the destination right now
k
I’d have one job file that builds the segments (results are in S3), and then a second job file that is configured to send URIs to the controller. I’m in a mtg now, but could look up the job config for that later…
a
Thanks!
k
So you first want to run the SegmentCreation job, with the outputdir in S3. Then run the SegmentUriPush job, with the same output dir (it’s a little confusing, that job uses files found in the output dir to build the list of URIs to send to the controller).
This assumes your Pinot cluster is configured to be able to read files from S3 (credentials, plugins). I’m using HDFS so I haven’t had to deal with S3 credential fun & games, but you can (IIRC) put this into the config files, though that’s a security risk.
a
Thanks! The S3 setup is working so I'll try this.
So why is this faster? Is SegmentUriPush slow and that's why it's advantageous to be able to run it on multiple servers rather than to bottleneck on the coordinator?
k
Pushing URIs is faster than pushing Tars, since you’re sending URIs to the controller, which get distributed to the servers, and read by N servers (in parallel).
SegmentUriPush is fast
The two slow things are (a) building the segment, and (b) pushing segments (tars) to the cluster via the controller
So fastest is to build the segments in a distributed environment (Hadoop or Spark), or at least make sure you’ve got max parallelism on the one server where you’re running the segment generation job.
and then save the segments in a distributed store (like HDFS or S3), so you can just send URIs to the cluster
a
Thanks!
Just out of curiosity -- why is it slow to push tars via the controller? Does it entail more than just the I/O?
I'm just trying to build out my mental model of what's going on during data ingestion
And actually -- is there a way I can measure how much time my batch ingest jobs are spending pushing segments via the controller right now, to see if that's my bottleneck?
k
If you watch the output of running the job, you’ll see when it starts pushing segments.
re pushing tars - you’ve got a single process (the Controller) needing to receive all of the segments over HTTP, and then turn around and send them to the various server processes. IIRC there might also be a download from S3 to the server where the batch job is running (though that would be silly, I know 🙂).