Tommaso Peresson
06/24/2022, 9:47 AMselect date,
fields.column1,
distinctcounthll(hllState)
from EventsHll
where fields.column2 in (1,2,10)
group by date, fields.column1
limit 300
And I get sub optimal performance:
"numServersQueried": 2,
"numServersResponded": 2,
"numSegmentsQueried": 1816,
"numSegmentsProcessed": 1816,
"numSegmentsMatched": 1816,
"numConsumingSegmentsQueried": 0,
"numDocsScanned": 154861922,
"numEntriesScannedInFilter": 214829377,
"numEntriesScannedPostFilter": 464585766,
"numGroupsLimitReached": false,
"totalDocs": 447509450,
"timeUsedMs": 25832,
"offlineThreadCpuTimeNs": 0,
"realtimeThreadCpuTimeNs": 0,
"offlineSystemActivitiesCpuTimeNs": 0,
"realtimeSystemActivitiesCpuTimeNs": 0,
"offlineResponseSerializationCpuTimeNs": 0,
"realtimeResponseSerializationCpuTimeNs": 0,
"offlineTotalCpuTimeNs": 0,
"realtimeTotalCpuTimeNs": 0,
"segmentStatistics": [],
"traceInfo": {},
"minConsumingFreshnessTimeMs": 0,
"numRowsResultSet": 140
So from my understanding the best combination would be a star-tree index for the aggregation and an inverted index for the filtering. Now when I look at the query explanation it seems that it uses the star-tree index filtering
Operator#$%0 Operator_Id#$%1 Parent_Id#$%2
------------------------------------------------------------------ ------------------ ----------------
"BROKER_REDUCE(limit:300)" "0" "-1"
"COMBINE_GROUPBY_ORDERBY" "1" "0"
"AGGREGATE_GROUPBY_ORDERBY" "2" "1"
"TRANSFORM(fields.column1, date)" "3" "2"
"PROJECT(fields.column1, date, distinctCountHLL__hllState)" "4" "3"
"FILTER_STARTREE_INDEX" "5" "4"
I'm using pinot 0.10.0 and I have the star tree index enabled on date, fields.column1, fields.column2
on distinctcounthll__hllState
and the inverted index enabled on fields.column2
.
Just as a reference the same query without the filtering takes 277ms with numEntriesScannedInFilter:0
and numEntriesScannedPostFilter:54480
.
My question is, how can I further optimise filtering when grouping by and using a star-tree index? Can the star tree index be used in conjunction with the inverted index?
Thanks a lotXiaobing
06/24/2022, 5:42 PMfields.column2
in startree index as the first split dimension. From the doc: “All columns in the filter and group-by clause of a query should be included in this list in order to use the star-tree index.”Tommaso Peresson
06/24/2022, 7:32 PMTommaso Peresson
06/24/2022, 7:33 PMXiaobing
06/24/2022, 8:15 PMstar tree index enabled on date, fields.column1I meant to config startree index like
"dimensionsSplitOrder": [ "fields.column2", "date", "fields.column1" ], // in this order
Tommaso Peresson
06/24/2022, 8:16 PMfields.column2
tooTommaso Peresson
06/24/2022, 8:17 PMXiaobing
06/24/2022, 8:21 PM