Hi everyone, Me and my team have set up a Pinot cl...
# troubleshooting
v
Hi everyone, Me and my team have set up a Pinot cluster, and started testing it out. To speed up the queries we configured a star tree index. I am not seeing as much of a performance upgrade for the percentileest and percentiletdigest functions compared to avg or sum functions, the latter performs much better. I was wondering as how does the star tree actually stores the percentileest, does it store the whole data structure and then checks that for the specific percentile for example 95th or does it do something else. I’d appreciate the help or advice or if there is some documentation about this that I missed.
m
When you say they don’t see much improvement, what’s the baseline? Also are these queries the same except aggr function, or do they have different filter/grouping?
v
The queries are the same, the same grouping and the same filtering. The avg function takes around 100ms as compared with the percentileest which takes around 3000ms to 4000ms
m
What about tdigest?
Also are you configuring startree to have tdigest ?
v
Around the same
image.png
This is the star tree configuration
m
Ok. Trying to understand l, are you comparing latency of avg vs tdigest? That is not a fair comparison. For speed up you want to compare same query with and without star tree
Can you share the response metadata of the tdigest query
Startree index helps in case you have low selectivity queries with millions of records selected by the query.
v
Yes, true I am comparing the latency of avg vs tdigest after the Star Tree index. But still after the star tree index the avg function had more performance gains than the t digest function
image.png
so there are around 56 million records being scanned, and i have tuned the max leaf records to be at 100k
Nevertheless the performance of the tdigest was improved after the star tree index was added, but 4000ms still seemed like a lot
m
Is there group by
v
yes it groups by hour
m
So one possibility is that you have too many groups which is performed outside of st index. Can you drop group by and see the latency
v
Okay so the latency now is at around 1600 ms
I guess the group by was adding some more latency
m
Yea so the rest us outside of startree
v
Thank you for your help, I will experiment with this some more : )
m
1600ms still seems too much though. May be 100k is avoiding the preaggregration. Or high dimensionality is contributing to it. Do you have latency for same query without star tree?
For group-by do you have any udf for time conversion? If so, that may be slowing it down. Using a derived column to store that granularity will help
v
We store the time as epoch ms, and do not use any UDF functions apart from the functions provided by apache pinot docs. But I think that using a derived column would help, and I will look into it some more.
p
Using a derived column brought down the latency 10x for group by queries for me. So I would definitely suggest to give that a try.
2
v
@Priyank Bagrecha thank you, I will definitely try it out.