If data ingestion jobs take a lot of memory to cre...
# general
a
If data ingestion jobs take a lot of memory to create a star tree index, how can I tune that? Does maxLeafRecords affect the memory usage of the segment creation job at all?
j
Yes, but that also affects the performance gain from the star-tree
a
Do I need to tune maxLeafRecords based on the size of my dataset or is the default of 10000 a sane value?
I'm asking because I can't get SegmentCreation jobs to run without an incredible amount of GC overhead, so I'm wondering if I'm doing something wrong
j
10k is usually good
Do you know how many dimensions are included in the star-tree?
a
My dimensionsSplitOrder has 7 items, and I've got like ~20 functionColumnPairs
j
I see. 7 dimensions are not much. Any high cardinality ones?
If you observe lots of GC, increasing the memory limit might help
a
I don't think anything is super high cardinality; one of them could have maybe a few tens of k of values though
By increasing the memory limit you mean the java heap size a la -Xms and -Xmx?
I'm currently running with
-Xms32G -Xmx32G
And I'm also limiting the segment generation parallelism to 4
j
Hmm, that's already quite high
a
I have verbose GC logging on and I see a lot of this:
Copy code
2021-05-12T19:10:45.404+0000: [Full GC (Ergonomics) [PSYoungGen: 5921280K->5921277K(8552960K)] [ParOldGen: 21347368K->21347368K(22369792K)] 27268648K->27268646K(30922752K), [Metaspace: 55657K->55657K(59392K)], 55.0908616 secs] [Times: user=1220.86 sys=14.76, real=55.08 secs] 
2021-05-12T19:11:40.497+0000: [Full GC (Ergonomics) [PSYoungGen: 5921280K->5921277K(8552960K)] [ParOldGen: 21347368K->21347368K(22369792K)] 27268648K->27268646K(30922752K), [Metaspace: 55657K->55657K(59392K)], 52.7552240 secs] [Times: user=1260.30 sys=13.89, real=52.75 secs] 
2021-05-12T19:12:33.252+0000: [Full GC (Ergonomics) [PSYoungGen: 5921280K->5921279K(8552960K)] [ParOldGen: 21347368K->21347368K(22369792K)] 27268648K->27268648K(30922752K), [Metaspace: 55657K->55657K(59392K)], 47.7370731 secs] [Times: user=1237.77 sys=9.23, real=47.74 secs]
j
Are you using the on-heap or off-heap mode?
a
Not sure 😬 What is that and how can I find out?
j
Do you use the spark job to create the segment?
a
No, I'm running it via the docker image
j
Oh, with the minion task?
In that case it is off-heap
Can you try further reducing the parallelism and see if the GC becomes better?
a
Not with minion either; I'm just running this on the command line
j
I see. Then maybe just reduce the parallelism and see if the GC goes down
a
Is there such thing as a too-big segment creation job?
j
What's the size of your input file and the output segment?
a
The input is about 80 parquet files, 16 GB in total
Not sure how big the output segment is because it's never succeeded 😮
j
In that case, can you start with single threaded?
200MB per file in average is not too large
a
Ok so I looked into this a little more -- the cardinality of my dimensions all together is 60,000,000
Like, if I multiply the cardinality of each dimension
Is that ridiculous?
j
Not too ridiculous, but chances are the star-tree won't get much compression after removing the dimension
If you can get one segment generated, we can check the segment metadata and see how many extra records generated for star-tree
a
Ok cool
Btw I realized I had
enableDefaultStarTree
enabled so it was also building one across all dimensions, so I set that to false