Hi everyone Me and my team have set up a Pinot cluster and s Apache Pinot #troubleshooting

Hi everyone, Me and my team have set up a Pinot cl...

Visar Buza

06/16/2022, 7:59 AM

Hi everyone, Me and my team have set up a Pinot cluster, and started testing it out. To speed up the queries we configured a star tree index. I am not seeing as much of a performance upgrade for the percentileest and percentiletdigest functions compared to avg or sum functions, the latter performs much better. I was wondering as how does the star tree actually stores the percentileest, does it store the whole data structure and then checks that for the specific percentile for example 95th or does it do something else. I’d appreciate the help or advice or if there is some documentation about this that I missed.

Mayank

06/16/2022, 8:03 AM

When you say they don’t see much improvement, what’s the baseline? Also are these queries the same except aggr function, or do they have different filter/grouping?

Visar Buza

06/16/2022, 8:11 AM

The queries are the same, the same grouping and the same filtering. The avg function takes around 100ms as compared with the percentileest which takes around 3000ms to 4000ms

Mayank

06/16/2022, 8:11 AM

What about tdigest?

Mayank

06/16/2022, 8:12 AM

Also are you configuring startree to have tdigest ?

Visar Buza

06/16/2022, 8:12 AM

Around the same

Visar Buza

06/16/2022, 8:12 AM

image.png

Visar Buza

06/16/2022, 8:13 AM

This is the star tree configuration

Mayank

06/16/2022, 8:20 AM

Ok. Trying to understand l, are you comparing latency of avg vs tdigest? That is not a fair comparison. For speed up you want to compare same query with and without star tree

Mayank

06/16/2022, 8:21 AM

Can you share the response metadata of the tdigest query

Mayank

06/16/2022, 8:22 AM

Startree index helps in case you have low selectivity queries with millions of records selected by the query.

Visar Buza

06/16/2022, 8:23 AM

Yes, true I am comparing the latency of avg vs tdigest after the Star Tree index. But still after the star tree index the avg function had more performance gains than the t digest function

Visar Buza

06/16/2022, 8:23 AM

image.png

Visar Buza

06/16/2022, 8:24 AM

so there are around 56 million records being scanned, and i have tuned the max leaf records to be at 100k

Visar Buza

06/16/2022, 8:26 AM

Nevertheless the performance of the tdigest was improved after the star tree index was added, but 4000ms still seemed like a lot

Mayank

06/16/2022, 8:26 AM

Is there group by

Visar Buza

06/16/2022, 8:26 AM

yes it groups by hour

Mayank

06/16/2022, 8:27 AM

So one possibility is that you have too many groups which is performed outside of st index. Can you drop group by and see the latency

Visar Buza

06/16/2022, 8:29 AM

Okay so the latency now is at around 1600 ms

Visar Buza

06/16/2022, 8:29 AM

I guess the group by was adding some more latency

Mayank

06/16/2022, 8:29 AM

Yea so the rest us outside of startree

Visar Buza

06/16/2022, 8:30 AM

Thank you for your help, I will experiment with this some more : )

Mayank

06/16/2022, 8:41 AM

1600ms still seems too much though. May be 100k is avoiding the preaggregration. Or high dimensionality is contributing to it. Do you have latency for same query without star tree?

Mayank

06/16/2022, 8:42 AM

For group-by do you have any udf for time conversion? If so, that may be slowing it down. Using a derived column to store that granularity will help

Visar Buza

06/16/2022, 3:17 PM

We store the time as epoch ms, and do not use any UDF functions apart from the functions provided by apache pinot docs. But I think that using a derived column would help, and I will look into it some more.

Priyank Bagrecha

06/16/2022, 3:55 PM

Using a derived column brought down the latency 10x for group by queries for me. So I would definitely suggest to give that a try.

➕ 2

Visar Buza

06/17/2022, 7:09 AM

@Priyank Bagrecha thank you, I will definitely try it out.

Open in Slack

Previous Next