Hi in the GroupByKeyGenerator under some circumstances raw k Apache Pinot #pinot-dev

Hi, in the GroupByKeyGenerator, under some circums...

Ashish

03/18/2022, 4:40 PM

Hi, in the GroupByKeyGenerator, under some circumstances, raw keys are used (by invoking getInternal method on the dictionary) instead of dictId. This does not go through the datafetcher and hence causes increased latency. Is there a way around this?

Mayank

03/20/2022, 2:05 AM

@Richard Startin may have some ideas here

Richard Startin

03/21/2022, 11:30 AM

hi @Ashish - I'm not sure I understand your concern. Let me clarify how group bys are evaluated. Assuming there is a dictionary in the first place (best case), the group by implementation has roughly 3 phases: 1. read a block of `dictId`s from each of the grouping columns (this is done via

ProjectionBlockValSet.getDictionaryIds*V

, which indirectly uses

DataFetcher

via the

DataBlockCache

) and compute `groupId`s from their combinations, this step produces an array with a

groupId

for each

docId

(encoded by the position in the array). A side effect of this is knowing how many distinct `groupId`s exist for each block. These values are stored in a

GroupKeyGenerator

. 2. segment local aggregation by

groupId

- this is safe to do because the `dictId`s the `groupId`s derive from are defined within the scope of a segment. An aggregation function is invoked for the aggregated value at each

docId

whenever the

groupId

at that

docId

is repeated. The results are stored in a

GroupByResultHolder

. 3. Server level aggregation - `dictId`s (and therefore `groupId`s) are not well defined beyond the scope of a segment so they can't be used beyond the scope of a segment. The `GroupKeyGenerator`s (which know how to map the `groupId`s to the actual group keys) and the

GroupByResultHolder

s (which map

groupId

s to partial aggregates) are combined in an

AggregationGroupByResult

which is then post-processed, the `groupId`s are translated to

groupKeys

(this is where

Dictionary.getInternal

is called), and then the partial aggregates are aggregated and upserted into an

IndexedTable

Richard Startin

03/21/2022, 11:37 AM

so I have a few comments, given the above: • "under some circumstances, raw keys are used (by invoking getInternal method on the dictionary) instead of dictId" - `dictId`s are always mapped to the corresponding raw values once segment boundaries are crossed, not just "under some circumstances". • "This does not go through the datafetcher" - the

DataFetcher

retrieves blocks of

dictId

s and this does happen earlier in the evaluation when there is a dictionary present, but the

DataFetcher

can't produce group keys valid across segments • "This does not go through the datafetcher and hence causes increased latency." - this is the kind of statement I would only make having done an experiment with some numbers in hand. • "Is there a way around this?" - seek solutions once it's confirmed there is a problem. 😄

Ashish

03/21/2022, 2:21 PM

Thanks for taking a look. Will attach the profile/stacktrace soon. The retrieval of dictId is ok - it's the call to getInternal that does not seem to go through the cache and hence we see the same latency when the same query is repeated. Usually, second run of the query runs faster but not in the case we observed

Richard Startin

03/21/2022, 2:25 PM

note that I'm not telling you that

getInternal

is fast, but that building group keys is completely unrelated to the purpose of the

DataFetcher

Richard Startin

03/21/2022, 2:29 PM

you can try this out, it might help https://github.com/apache/pinot/pull/8376

Richard Startin

03/21/2022, 2:31 PM

for 1D group bys (1 group by expression) there's no point in doing any caching because each extraction will only happen once

Ashish

03/21/2022, 2:35 PM

Will try it out. Not sure why retrieving dict values through getInternal shouldn't go through the data fetcher and cached - it would have done that if the same column was present in projection only. No?

Richard Startin

03/21/2022, 2:40 PM

I've explained https://apache-pinot.slack.com/archives/C013WKLT5T7/p1647862246215949?thread_ts=1647621618.896269&cid=C013WKLT5T7

Richard Startin

03/21/2022, 2:42 PM

the

DataFetcher

retrieves blocks of values indexed by docId not by groupId

Richard Startin

03/21/2022, 2:52 PM

more tangibly, if you have values

Copy code

| dim1  | dim2  | metric |
|-------|-------|--------|
| "a1"  |  "b1" | 10     | 
| "a1"  |  "b2" | 11     | 
| "a2"  |  "b1" | 12     | 
| "a2"  |  "b4" | 9      |
| "a1"  |  "b1" | 8      |

then a call to the

DataFetcher.getStringValuesSV(dim1)

would retrieve

Copy code

"a1", "a1", "a2", "a2", "a1"

and ``DataFetcher.getStringValuesSV(dim2)` would retrieve

Copy code

"b1", "b2", "b1", "b4", "b1"

but there are only 4 groups:

Copy code

("a1", "b1"), ("a1", "b2"), ("a2", "b1"), ("a2, "b4")

the calls to

Dictionary.getInternal

take place when mapping dim1:0 to "a1", dim1:1 to "a2", dim2:0 to "b1", dim2:1 to "b2", dim2:3 to "b4" (no group has "b3"), basically it's a completely different operation to extracting the values for each docId. One thing that could be improved is if dim1 has cardinality x, and dim2 has cardinality y, you will extract each dictionary value from dim1 up to y times, and each from dim2 up to x times, so if the cardinalities are low enough it makes sense to cache the raw values indexed by dictionary id (not docId)

Ashish

03/21/2022, 6:36 PM

Yes, that makes sense.

Richard Startin

03/21/2022, 7:22 PM

my measurements suggest that this change should result in up to 20% performance improvement, and up to 50% reduction in allocation when there is a large number of group by keys

Richard Startin

03/21/2022, 7:22 PM

so we will probably merge this, it would be good to know if you see any improvement and how much

Ashish

03/23/2022, 2:24 PM

Thanks will try it out

Open in Slack

Previous Next