Asking because if I add a where clause (that filte...
# pinot-perf-tuning
k
Asking because if I add a where clause (that filters out nothing) using a dimension NOT in my dimensionsSplitOrder list, I though that the star tree wouldn’t be used - and the query time is the same for the case without that where clause.
c
j
If it filters out nothing, star-tree can still be used. You may check the
numDocsScanned
to see if the query is scanning the pre-aggregated records
k
numDocsScanned
= 1B (out of 1.7B) for case where the filter matches a few records, 362M for case where the filter matches no records, and 362M for the case of no filtering. But time taken is almost identical in all cases (27 seconds, +/- 1 second).
j
Can you share the query and the star-tree config? Seems not much documents are pre-aggregated
Too many documents selected even with star-tree
k
Copy code
"starTreeIndexConfigs": [{
      "dimensionsSplitOrder": [
        "advertiser",
        "adHash",
        "network",
        "imageSize",
        "adType",
        "platform",
        "country",
        "crawlDays"
      ],
      
      "skipStarNodeCreationForDimensions": [
      ],
      
      "functionColumnPairs": [
        "SUM__adSpend",
        "SUM__impressions",
        "DistinctCountHLL__adHash",
        "DistinctCountHLL__crawlDays",
        "MIN__crawlDays",
        "MAX__crawlDays"
      ],
      
      "maxLeafRecords": 10000
    },
select adHash,advertiser,sum(adSpend) from crawldata group by adHash,advertiser order by sum(adSpend) desc limit 100
j
Does
adHash
and
advertiser
has very high cardinality?
k
Yes
j
Most of the query time should be spent on grouping these 2 columns
Star-tree won't help much for this query as not much documents can be pre-aggregated
Only records with the same
adHash
and
advertiser
can be pre-aggregated
k
There’s a one-to-many relationship from advertiser to adHash. So given an adHash, it’s always for one advertiser.
j
I see. Can you try
select adHash, sum(adSpend) from crawldata group by adHash order by sum(adSpend) desc limit 100
and see if it is faster?
k
it’s slower - timed out after 30 sec
Retrying with 50 sec
j
It should definitely be cheaper than grouping on 2 columns
k
I’m going to have to log onto servers and check logs, as now the initial query is also timing out. So feels like something is borked…
I did just run a distinct_count query on adHash, which probably blew memory somewhere 😞
As there are likely > 1B unique adHashes, out of 1.7B records.
j
Based on the
numDocsScanned
for star-tree queries, maybe ~300M unique adHashes 😉
k
I partition each month of data into 30 segments, by adHash. So likely there’s about 300M unique adHashes per segment…
Hi @Jackie just saw your response to the other user, where you said:
Another way is to enable the tracing for the query and see if it uses the 
StarTreeFilterOperator
I had enabled tracing, and I didn’t see this in the trace output. So that would indicate my star tree isn’t being used for some reason, yes?
j
Can you paste the tracing?
k
OK, my bad - with tracing on, and the right query, I do see StarTreeFilterOperators in the JSON, thanks. For one of the many pieces, operators with time > 1ms were:
Copy code
{\"0_10\":[
	{\"StarTreeFilterOperator Time\":21}
	{\"DocIdSetOperator Time\":69}
	{\"ProjectionOperator Time\":69}
	{\"TransformOperator Time\":69}
	{\"DocIdSetOperator Time\":17}
	{\"ProjectionOperator Time\":17}
	{\"TransformOperator Time\":17}
	{\"AggregationGroupByOrderByOperator Time\":97}]}
j
AggregationGroupByOrderByOperator
time includes the time for other operators because it is nested
k
But it’s not the sum of times, so curious what that actually means.
j
The operators are chained with some extra operations
Copy code
AggregationGroupByOrderByOperator
TransformOperator
ProjectionOperator
DocIdSetOperator
StarTreeFilterOperator
And each one should include the time for the previous layer + time for the extra operations
E.g. with these 2 entries:
Copy code
{\"TransformOperator Time\":17}
	{\"AggregationGroupByOrderByOperator Time\":97}]}
We know the engine spends 17ms running the transform operator + 80ms aggregating the records
k
So given my example, how would I know which operators were chained (and thus showing a sum)? Or is it just by increasing time, so StarTreeFilter is 21ms, then next DocIdSet is chained and thus 48ms, and no time for the next Projection or Transform, but the next DocIdSet is 17ms? Just trying to figure out how to interpret the numbers…