If data ingestion jobs take a lot of memory to create a star Apache Pinot #general

If data ingestion jobs take a lot of memory to cre...

Aaron Wishnick

05/12/2021, 7:07 PM

If data ingestion jobs take a lot of memory to create a star tree index, how can I tune that? Does maxLeafRecords affect the memory usage of the segment creation job at all?

Jackie

05/12/2021, 7:08 PM

Yes, but that also affects the performance gain from the star-tree

Aaron Wishnick

05/12/2021, 7:09 PM

Do I need to tune maxLeafRecords based on the size of my dataset or is the default of 10000 a sane value?

Aaron Wishnick

05/12/2021, 7:09 PM

I'm asking because I can't get SegmentCreation jobs to run without an incredible amount of GC overhead, so I'm wondering if I'm doing something wrong

Jackie

05/12/2021, 7:09 PM

10k is usually good

Jackie

05/12/2021, 7:09 PM

Do you know how many dimensions are included in the star-tree?

Aaron Wishnick

05/12/2021, 7:10 PM

My dimensionsSplitOrder has 7 items, and I've got like ~20 functionColumnPairs

Jackie

05/12/2021, 7:10 PM

I see. 7 dimensions are not much. Any high cardinality ones?

Jackie

05/12/2021, 7:12 PM

If you observe lots of GC, increasing the memory limit might help

Aaron Wishnick

05/12/2021, 7:12 PM

I don't think anything is super high cardinality; one of them could have maybe a few tens of k of values though

Aaron Wishnick

05/12/2021, 7:12 PM

By increasing the memory limit you mean the java heap size a la -Xms and -Xmx?

Aaron Wishnick

05/12/2021, 7:12 PM

I'm currently running with

-Xms32G -Xmx32G

Aaron Wishnick

05/12/2021, 7:13 PM

And I'm also limiting the segment generation parallelism to 4

Jackie

05/12/2021, 7:13 PM

Hmm, that's already quite high

Aaron Wishnick

05/12/2021, 7:13 PM

I have verbose GC logging on and I see a lot of this:

Copy code

2021-05-12T19:10:45.404+0000: [Full GC (Ergonomics) [PSYoungGen: 5921280K->5921277K(8552960K)] [ParOldGen: 21347368K->21347368K(22369792K)] 27268648K->27268646K(30922752K), [Metaspace: 55657K->55657K(59392K)], 55.0908616 secs] [Times: user=1220.86 sys=14.76, real=55.08 secs] 
2021-05-12T19:11:40.497+0000: [Full GC (Ergonomics) [PSYoungGen: 5921280K->5921277K(8552960K)] [ParOldGen: 21347368K->21347368K(22369792K)] 27268648K->27268646K(30922752K), [Metaspace: 55657K->55657K(59392K)], 52.7552240 secs] [Times: user=1260.30 sys=13.89, real=52.75 secs] 
2021-05-12T19:12:33.252+0000: [Full GC (Ergonomics) [PSYoungGen: 5921280K->5921279K(8552960K)] [ParOldGen: 21347368K->21347368K(22369792K)] 27268648K->27268648K(30922752K), [Metaspace: 55657K->55657K(59392K)], 47.7370731 secs] [Times: user=1237.77 sys=9.23, real=47.74 secs]

Jackie

05/12/2021, 7:13 PM

Are you using the on-heap or off-heap mode?

Aaron Wishnick

05/12/2021, 7:14 PM

Not sure 😬 What is that and how can I find out?

Jackie

05/12/2021, 7:15 PM

Do you use the spark job to create the segment?

Aaron Wishnick

05/12/2021, 7:15 PM

No, I'm running it via the docker image

Jackie

05/12/2021, 7:16 PM

Oh, with the minion task?

Jackie

05/12/2021, 7:17 PM

In that case it is off-heap

Jackie

05/12/2021, 7:18 PM

Can you try further reducing the parallelism and see if the GC becomes better?

Aaron Wishnick

05/12/2021, 7:20 PM

Not with minion either; I'm just running this on the command line

Jackie

05/12/2021, 7:22 PM

I see. Then maybe just reduce the parallelism and see if the GC goes down

Aaron Wishnick

05/12/2021, 7:22 PM

Is there such thing as a too-big segment creation job?

Jackie

05/12/2021, 7:23 PM

What's the size of your input file and the output segment?

Aaron Wishnick

05/12/2021, 7:26 PM

The input is about 80 parquet files, 16 GB in total

Aaron Wishnick

05/12/2021, 7:26 PM

Not sure how big the output segment is because it's never succeeded 😮

Jackie

05/12/2021, 7:27 PM

In that case, can you start with single threaded?

Jackie

05/12/2021, 7:28 PM

200MB per file in average is not too large

Aaron Wishnick

05/12/2021, 8:10 PM

Ok so I looked into this a little more -- the cardinality of my dimensions all together is 60,000,000

Aaron Wishnick

05/12/2021, 8:10 PM

Like, if I multiply the cardinality of each dimension

Aaron Wishnick

05/12/2021, 8:11 PM

Is that ridiculous?

Jackie

05/12/2021, 8:18 PM

Not too ridiculous, but chances are the star-tree won't get much compression after removing the dimension

Jackie

05/12/2021, 8:18 PM

If you can get one segment generated, we can check the segment metadata and see how many extra records generated for star-tree

Aaron Wishnick

05/12/2021, 8:20 PM

Ok cool

Aaron Wishnick

05/12/2021, 8:20 PM

Btw I realized I had

enableDefaultStarTree

enabled so it was also building one across all dimensions, so I set that to false

Open in Slack

Previous Next