Hi everybody I have a question for you I m trying to figure Apache Pinot #troubleshooting

Hi everybody, I have a question for you. I'm tryin...

Tommaso Peresson

06/24/2022, 9:47 AM

Hi everybody, I have a question for you. I'm trying to figure out the best combination of filters for queries like:

Copy code

select date, 
         fields.column1, 
         distinctcounthll(hllState)
  from EventsHll 
  where fields.column2 in (1,2,10)
  group by date, fields.column1 
  limit 300

And I get sub optimal performance:

Copy code

"numServersQueried": 2,
  "numServersResponded": 2,
  "numSegmentsQueried": 1816,
  "numSegmentsProcessed": 1816,
  "numSegmentsMatched": 1816,
  "numConsumingSegmentsQueried": 0,
  "numDocsScanned": 154861922,
  "numEntriesScannedInFilter": 214829377,
  "numEntriesScannedPostFilter": 464585766,
  "numGroupsLimitReached": false,
  "totalDocs": 447509450,
  "timeUsedMs": 25832,
  "offlineThreadCpuTimeNs": 0,
  "realtimeThreadCpuTimeNs": 0,
  "offlineSystemActivitiesCpuTimeNs": 0,
  "realtimeSystemActivitiesCpuTimeNs": 0,
  "offlineResponseSerializationCpuTimeNs": 0,
  "realtimeResponseSerializationCpuTimeNs": 0,
  "offlineTotalCpuTimeNs": 0,
  "realtimeTotalCpuTimeNs": 0,
  "segmentStatistics": [],
  "traceInfo": {},
  "minConsumingFreshnessTimeMs": 0,
  "numRowsResultSet": 140

So from my understanding the best combination would be a star-tree index for the aggregation and an inverted index for the filtering. Now when I look at the query explanation it seems that it uses the star-tree index filtering

Copy code

Operator#$%0                                                       Operator_Id#$%1    Parent_Id#$%2   
 ------------------------------------------------------------------ ------------------ ---------------- 
  "BROKER_REDUCE(limit:300)"                                         "0"                "-1"            
  "COMBINE_GROUPBY_ORDERBY"                                          "1"                "0"             
  "AGGREGATE_GROUPBY_ORDERBY"                                        "2"                "1"             
  "TRANSFORM(fields.column1, date)"                                  "3"                "2"             
  "PROJECT(fields.column1, date, distinctCountHLL__hllState)"        "4"                "3"             
  "FILTER_STARTREE_INDEX"                                            "5"                "4"

I'm using pinot 0.10.0 and I have the star tree index enabled on

date, fields.column1, fields.column2

distinctcounthll__hllState

and the inverted index enabled on

fields.column2

. Just as a reference the same query without the filtering takes 277ms with

numEntriesScannedInFilter:0

and

numEntriesScannedPostFilter:54480

. My question is, how can I further optimise filtering when grouping by and using a star-tree index? Can the star tree index be used in conjunction with the inverted index? Thanks a lot

Xiaobing

06/24/2022, 5:42 PM

I wonder if it helps to also add

fields.column2

in startree index as the first split dimension. From the doc: “All columns in the filter and group-by clause of a query should be included in this list in order to use the star-tree index.”

Tommaso Peresson

06/24/2022, 7:32 PM

it already is

Tommaso Peresson

06/24/2022, 7:33 PM

also cardinality in neither of the columns is very high

Xiaobing

06/24/2022, 8:15 PM

hmm.. I saw you said about

star tree index enabled on date, fields.column1

I meant to config startree index like

Copy code

"dimensionsSplitOrder": [ "fields.column2", "date", "fields.column1" ], // in this order

Tommaso Peresson

06/24/2022, 8:16 PM

My bad it's configured for

fields.column2

too

👌 1

Tommaso Peresson

06/24/2022, 8:17 PM

Which considerations should I make in the ordering of the dimensions in the star tree index

Xiaobing

06/24/2022, 8:21 PM

per how it’s traversed during query execution (doc), I’d say to put columns in predicates closer to root

Open in Slack

Previous Next