Hi ! We are evaluating Pinot (`7c15bc`) to replace...
# troubleshooting
t
Hi ! We are evaluating Pinot (
7c15bc
) to replace trino for low latency geospatial aggregation queries. We have a table with about
10^8
records, using the star tree indexes we are able to perform our queries orders of magnitude faster than trino 🚀. But when using geospatial filtering,
ST_Contains
we experience performance degradation (from <200ms to over 10 seconds) even if we are using H3 geospatial indexes with different resolutions (which are triggered as expected) and the geometres are not complex. Our queries are pretty straightforward:
Copy code
select sum(x) from <table> where ST_Contains(ST_GeogFromText('...'),location_st_point) = 1 and ...
Is there any way we can improve latencies for geospatial aggregations ? Should we just avoid geospatial filters with this number of records ?
👍 1
m
@Yupeng Fu
Can you also share the query and response metadata
t
Sure, thank you for the quick response 😄 query
Copy code
select count(*) from xxxx_OFFLINE where ST_Contains(ST_GeogFromText('POLYGON((10.121899631484375 43.78360214117981,12.992289079218752 44.4190176976316,14.654093496328125 42.34891239577081,12.582233443828127 41.157530371951424,10.121899631484375 43.78360214117981))'),location_st_point) = 1
metadata
Copy code
timeUsedMs: 12101
numDocsScanned: 6476678
totalDocs: 71106843
numServersQueried: 3
numServersResponded: 3
numSegmentsQueried: 5
numSegmentsProcessed: 5
numSegmentsMatched: 5
numConsumingSegmentsQueried: 0
numEntriesScannedInFilter: 9567118
plan
Copy code
BROKER_REDUCE(limit:10)	1	0
COMBINE_AGGREGATE	2	1
PLAN_START(numSegmentsForThisPlan:5)	-1	-1
AGGREGATE(aggregations:count(*))	3	2
TRANSFORM_PASSTHROUGH()	4	3
PROJECT()	5	4
DOC_ID_SET	6	5
INCLUSION_FILTER_H3_INDEX(inclusionIndex:h3_index,operator:EQ,predicate:stcontains('8400000001000000050000000040243e69a0e53d5c4045e44d1330b8b04029fc0d506ed851404635a25f35e2f0402d4ee557be82dd40452ca92950809840292a1a807fc22e40449429f489d43c40243e69a0e53d5c4045e44d1330b8b0',location_st_point) = '1')
m
Yeah, the query does seem expensive. Any ideas on how to further optimize it @Yupeng Fu, @Jackie?
y
st_contain can be expensive, depending on the polygon size and complexity
try to use h3index, and set the resolution accordingly
t
Thank you, we are going to try different resolutions for the H3 index. Ideally should we size it to trigger 1 H3 partition per query ? Can we help the query planner to prune docs by providing additional filters before the ST_Contains evaluation ?
k
Yes, you can try that