https://pinot.apache.org/ logo
a

Aaron Wishnick

05/10/2021, 9:26 PM
I'm trying to track down where my SegmentCreation job is spending time. It's been running for a few hours, but here are the only time measurements I see:
Copy code
Assembled and processed 7990100 records from 17 columns in 121766 ms: 65.618484 rec/ms, 1115.5142 cell/ms
time spent so far 0% reading (237 ms) and 99% processing (121766 ms)
at row 7990100. reading next block
block read in memory in 44 ms. row count = 1418384
Finished building StatsCollector!
Collected stats for 9408484 documents
Created dictionary for INT column: ...
...
RecordReader initialized will read a total of 9408484 records.
at row 0. reading next block
Got brand-new decompressor [.gz]
block read in memory in 133 ms. row count = 7990100
Start building IndexCreator!
Assembled and processed 7990100 records from 17 columns in 127060 ms: 62.884464 rec/ms, 1069.0359 cell/ms
time spent so far 0% reading (133 ms) and 99% processing (127060 ms)
at row 7990100. reading next block
block read in memory in 26 ms. row count = 1418384
Finished records indexing in IndexCreator!
Finished segment seal!
...
Generated 25884 star-tree records from 9408484 segment records
Finished constructing star-tree, got 1228 tree nodes and 2058 records under star-node
Finished creating aggregated documents, got 1227 aggregated records
Finished building star-tree in 276631ms
Starting building star-tree with config: StarTreeV2BuilderConfig[...]
Generated 9408484 star-tree records from 9408484 segment records
It's been stuck at that last line for a long time but still seems to be doing something. All the time measurements only add up to a few minutes. Any idea where it's spending time?
k

Ken Krugler

05/10/2021, 10:32 PM
There was a similar question recently in #general, where GCing could be a potential source of slowness (trying to include a ref, let’s see if this works: https://apache-pinot.slack.com/archives/CDRCA57FC/p1620224987160700)
x

Xiang Fu

05/10/2021, 11:39 PM
I feel it’s mostly spent on startree generation
a

Aaron Wishnick

05/11/2021, 1:12 PM
Awesome, thanks for the pointer. I am running with the default heap size so that could be a culprit.
How can I override the default JVM heap settings if I'm using the docker image? Is the right fix for me to override those heap settings or is that the wrong fix here?
I see pinot running the data ingestion job with
-Xms4G
k

Ken Krugler

05/11/2021, 4:32 PM
Hi @Aaron Wishnick - sorry, don’t know best approach with Docker. I run the ingestion job on a bare metal server, so just changed what was in the script used to launch that.
In
bin/pinot-admin.sh
, I changed one line to:
Copy code
ALL_JAVA_OPTS="-Xms15G -Xmx15G -Dlog4j2.configurationFile=conf/pinot-admin-log4j2.xml"
a

Aaron Wishnick

05/11/2021, 4:45 PM
Thanks
I'm no Java expert but this looks like really bad GC overhead, right?
Copy code
2021-05-11T20:38:54.205+0000: [Full GC (Ergonomics) [PSYoungGen: 3348480K->2829207K(4470272K)] [ParOldGen: 11184779K->11184779K(11185152K)] 14533259K->14013987K(15655424K), [Metaspace: 55133K->55133K(1099776K)], <tel:546.0819375|546.0819375> secs] [Times: user=31738.44 sys=70.66, real=546.08 secs] 
2021-05-11T20:48:00.653+0000: [Full GC (Ergonomics) [PSYoungGen: 3348480K->2829208K(4470272K)] [ParOldGen: 11184779K->11184779K(11185152K)] 14533259K->14013987K(15655424K), [Metaspace: 55133K->55133K(1099776K)], <tel:522.6226807|522.6226807> secs] [Times: user=28270.29 sys=112.95, real=522.62 secs] 
2021-05-11T20:56:43.693+0000: [Full GC (Ergonomics) [PSYoungGen: 3348480K->2829208K(4470272K)] [ParOldGen: 11184779K->11184779K(11185152K)] 14533259K->14013987K(15655424K), [Metaspace: 55133K->55133K(1099776K)], <tel:563.4447500|563.4447500> secs] [Times: user=35688.72 sys=23.60, real=563.44 secs]
k

Ken Krugler

05/11/2021, 9:43 PM
Yes, that’s pretty bad
a

Aaron Wishnick

05/12/2021, 4:08 PM
Is the fix to make the heap bigger? I'm using a 16 G heap right now. Or are there other data ingest settings I can tune to reduce memory usage?
If it's relevant, the segment I'm ingesting is comprised of ~80 different parquet files, any chance that pinot is doing something funky with parallelism, like spinning up a job per parquet file in parallel so I've got 80 jobs? I have the default settings there.
k

Ken Krugler

05/12/2021, 4:10 PM
The job file should have a parallelism setting in it:
segmentCreationJobParallelism
The recommendation is to set this equal to the number of cores, but if you’re processing large input files, you might need to reduce the parallelism.
Generating star-tree indexes can use significant amounts of memory, so when I added some of these to existing tables, I had to do exactly that (from 16 down to 4)
a

Aaron Wishnick

05/12/2021, 4:11 PM
Ok, thanks, I'll give that a shot
The machine I'm running on has 40 cores, and it does look like every core is maxed out right now
k

Ken Krugler

05/12/2021, 4:12 PM
That would do it 🙂
a

Aaron Wishnick

05/12/2021, 4:14 PM
😬
k

Ken Krugler

05/12/2021, 4:14 PM
I should update the docs to cover the above issues…
a

Aaron Wishnick

05/12/2021, 4:14 PM
Do you have any guidance around heap sizing? I do have plenty more memory, but I'm aware that sometimes making the heap bigger makes things worse
k

Ken Krugler

05/12/2021, 4:15 PM
I would leave the heap at 16gb, and reduce parallelism to say 4 (just as a test)
a

Aaron Wishnick

05/12/2021, 4:15 PM
That'd be helpful! I noticed that the only place segmentCreationJobParallelism is mentioned is https://docs.pinot.apache.org/users/tutorials/use-s3-and-pinot-in-docker, the pages on job specs don't mention it
Got it, let's see how that goes
Heh, as far as I can tell this is going way faster so far
I'll confirm once these jobs finish
k

Ken Krugler

05/12/2021, 4:19 PM
600 second full GCs tend to slow things down 🙂
a

Aaron Wishnick

05/12/2021, 4:25 PM
🙂
Ok, this is better but not great:
Copy code
2021-05-12T16:25:48.464+0000: [Full GC (Ergonomics) [PSYoungGen: 499972K->0K(4527104K)] [ParOldGen: 10669128K->11093033K(11185152K)] 11169101K->11093033K(15712256K), [Metaspace: 55592K->55592K(1099776K)], 36.2899382 secs] [Times: user=962.56 sys=0.69, real=36.29 secs] 
2021-05-12T16:26:28.268+0000: [Full GC (Ergonomics) [PSYoungGen: 3428864K->228298K(4527104K)] [ParOldGen: 11093033K->11170454K(11185152K)] 14521897K->11398752K(15712256K), [Metaspace: 55592K->55592K(1099776K)], 51.5969673 secs] [Times: user=1290.02 sys=9.49, real=51.60 secs] 
2021-05-12T16:27:27.406+0000: [Full GC (Ergonomics) [PSYoungGen: 3428864K->496053K(4527104K)] [ParOldGen: 11170454K->11170454K(11185152K)] 14599318K->11666507K(15712256K), [Metaspace: 55592K->55592K(1099776K)], 43.8830724 secs] [Times: user=936.58 sys=13.11, real=43.88 secs]
Is the next move to increase my heap size? Or is there anything else I should do? Am I trying to process too much data and I should break my parquet files up into smaller pieces or something? Or other JVM GC settings that help with this
k

Ken Krugler

05/12/2021, 7:54 PM
How big is a typical segment you’re generating? You could take one of these, unpack it, and example the metadata file to see how big the various columns are. That would give you some indication of whether you should generate smaller input files.
a

Aaron Wishnick

05/12/2021, 7:57 PM
The input is about 80 parquet files, 16 GB in total
k

Ken Krugler

05/12/2021, 7:57 PM
That sounds smaller than what I’d expect. But Parquet can compress data down a lot.
x

Xiang Fu

05/12/2021, 9:16 PM
I somehow feel the segment itself is ok, startree itself may explode the memory
Copy code
Generated 25884 star-tree records from 9408484 segment records
Finished constructing star-tree, got 1228 tree nodes and 2058 records under star-node
Finished creating aggregated documents, got 1227 aggregated records
Finished building star-tree in 276631ms
Starting building star-tree with config: StarTreeV2BuilderConfig[...]
Generated 9408484 star-tree records from 9408484 segment records
do you have startree configs and one segment metadata
k

Ken Krugler

05/12/2021, 9:23 PM
Yes,
Generated 9408484 star-tree records from 9408484 segment records
seems odd - maybe trying to use a column in the star-tree definition which is unique per row? If so, that won’t help your queries.
a

Aaron Wishnick

05/13/2021, 2:17 PM
Nice catch
So it looks like one issue was I had
enableDefaultStarTree
set to true, so it was generating a second star tree across all dimensions, and yes, one of those dimensions could have been unique
I now see:
Generated 25884 star-tree records from 9408484 segment records