Ashish
03/18/2022, 4:40 PMMayank
Richard Startin
03/21/2022, 11:30 AMProjectionBlockValSet.getDictionaryIds*V, which indirectly uses DataFetcher via the DataBlockCache) and compute `groupId`s from their combinations, this step produces an array with a groupId for each docId (encoded by the position in the array). A side effect of this is knowing how many distinct `groupId`s exist for each block. These values are stored in a GroupKeyGenerator .
2. segment local aggregation by groupId - this is safe to do because the `dictId`s the `groupId`s derive from are defined within the scope of a segment. An aggregation function is invoked for the aggregated value at each docId whenever the groupId at that docId is repeated. The results are stored in a GroupByResultHolder .
3. Server level aggregation - `dictId`s (and therefore `groupId`s) are not well defined beyond the scope of a segment so they can't be used beyond the scope of a segment. The `GroupKeyGenerator`s (which know how to map the `groupId`s to the actual group keys) and the GroupByResultHolder s (which map groupId s to partial aggregates) are combined in an AggregationGroupByResult which is then post-processed, the `groupId`s are translated to groupKeys (this is where Dictionary.getInternal is called), and then the partial aggregates are aggregated and upserted into an IndexedTableRichard Startin
03/21/2022, 11:37 AMDataFetcher retrieves blocks of dictId s and this does happen earlier in the evaluation when there is a dictionary present, but the DataFetcher can't produce group keys valid across segments
• "This does not go through the datafetcher and hence causes increased latency." - this is the kind of statement I would only make having done an experiment with some numbers in hand.
• "Is there a way around this?" - seek solutions once it's confirmed there is a problem. 😄Ashish
03/21/2022, 2:21 PMRichard Startin
03/21/2022, 2:25 PMgetInternal is fast, but that building group keys is completely unrelated to the purpose of the DataFetcherRichard Startin
03/21/2022, 2:29 PMRichard Startin
03/21/2022, 2:31 PMAshish
03/21/2022, 2:35 PMRichard Startin
03/21/2022, 2:40 PMRichard Startin
03/21/2022, 2:42 PMDataFetcher retrieves blocks of values indexed by docId not by groupIdRichard Startin
03/21/2022, 2:52 PM| dim1 | dim2 | metric |
|-------|-------|--------|
| "a1" | "b1" | 10 |
| "a1" | "b2" | 11 |
| "a2" | "b1" | 12 |
| "a2" | "b4" | 9 |
| "a1" | "b1" | 8 |
then a call to the DataFetcher.getStringValuesSV(dim1) would retrieve
"a1", "a1", "a2", "a2", "a1"
and ``DataFetcher.getStringValuesSV(dim2)` would retrieve
"b1", "b2", "b1", "b4", "b1"
but there are only 4 groups:
("a1", "b1"), ("a1", "b2"), ("a2", "b1"), ("a2, "b4")
the calls to Dictionary.getInternal take place when mapping dim1:0 to "a1", dim1:1 to "a2", dim2:0 to "b1", dim2:1 to "b2", dim2:3 to "b4" (no group has "b3"), basically it's a completely different operation to extracting the values for each docId.
One thing that could be improved is if dim1 has cardinality x, and dim2 has cardinality y, you will extract each dictionary value from dim1 up to y times, and each from dim2 up to x times, so if the cardinalities are low enough it makes sense to cache the raw values indexed by dictionary id (not docId)Ashish
03/21/2022, 6:36 PMRichard Startin
03/21/2022, 7:22 PMRichard Startin
03/21/2022, 7:22 PMAshish
03/23/2022, 2:24 PM