Hi everybody, I have a question for you. I'm tryin...
# troubleshooting
t
Hi everybody, I have a question for you. I'm trying to figure out the best combination of filters for queries like:
Copy code
select date, 
         fields.column1, 
         distinctcounthll(hllState)
  from EventsHll 
  where fields.column2 in (1,2,10)
  group by date, fields.column1 
  limit 300
And I get sub optimal performance:
Copy code
"numServersQueried": 2,
  "numServersResponded": 2,
  "numSegmentsQueried": 1816,
  "numSegmentsProcessed": 1816,
  "numSegmentsMatched": 1816,
  "numConsumingSegmentsQueried": 0,
  "numDocsScanned": 154861922,
  "numEntriesScannedInFilter": 214829377,
  "numEntriesScannedPostFilter": 464585766,
  "numGroupsLimitReached": false,
  "totalDocs": 447509450,
  "timeUsedMs": 25832,
  "offlineThreadCpuTimeNs": 0,
  "realtimeThreadCpuTimeNs": 0,
  "offlineSystemActivitiesCpuTimeNs": 0,
  "realtimeSystemActivitiesCpuTimeNs": 0,
  "offlineResponseSerializationCpuTimeNs": 0,
  "realtimeResponseSerializationCpuTimeNs": 0,
  "offlineTotalCpuTimeNs": 0,
  "realtimeTotalCpuTimeNs": 0,
  "segmentStatistics": [],
  "traceInfo": {},
  "minConsumingFreshnessTimeMs": 0,
  "numRowsResultSet": 140
So from my understanding the best combination would be a star-tree index for the aggregation and an inverted index for the filtering. Now when I look at the query explanation it seems that it uses the star-tree index filtering
Copy code
Operator#$%0                                                       Operator_Id#$%1    Parent_Id#$%2   
 ------------------------------------------------------------------ ------------------ ---------------- 
  "BROKER_REDUCE(limit:300)"                                         "0"                "-1"            
  "COMBINE_GROUPBY_ORDERBY"                                          "1"                "0"             
  "AGGREGATE_GROUPBY_ORDERBY"                                        "2"                "1"             
  "TRANSFORM(fields.column1, date)"                                  "3"                "2"             
  "PROJECT(fields.column1, date, distinctCountHLL__hllState)"        "4"                "3"             
  "FILTER_STARTREE_INDEX"                                            "5"                "4"
I'm using pinot 0.10.0 and I have the star tree index enabled on
date, fields.column1, fields.column2
on
distinctcounthll__hllState
and the inverted index enabled on
fields.column2
. Just as a reference the same query without the filtering takes 277ms with
numEntriesScannedInFilter:0
and
numEntriesScannedPostFilter:54480
. My question is, how can I further optimise filtering when grouping by and using a star-tree index? Can the star tree index be used in conjunction with the inverted index? Thanks a lot
x
I wonder if it helps to also add
fields.column2
in startree index as the first split dimension. From the doc: “All columns in the filter and group-by clause of a query should be included in this list in order to use the star-tree index.”
t
it already is
also cardinality in neither of the columns is very high
x
hmm.. I saw you said about
star tree index enabled on date, fields.column1
I meant to config startree index like
Copy code
"dimensionsSplitOrder": [ "fields.column2", "date", "fields.column1" ], // in this order
t
My bad it's configured for
fields.column2
too
👌 1
Which considerations should I make in the ordering of the dimensions in the star tree index
x
per how it’s traversed during query execution (doc), I’d say to put columns in predicates closer to root