Ashish
03/18/2022, 4:40 PMMayank
Richard Startin
03/21/2022, 11:30 AMProjectionBlockValSet.getDictionaryIds*V
, which indirectly uses DataFetcher
via the DataBlockCache
) and compute `groupId`s from their combinations, this step produces an array with a groupId
for each docId
(encoded by the position in the array). A side effect of this is knowing how many distinct `groupId`s exist for each block. These values are stored in a GroupKeyGenerator
.
2. segment local aggregation by groupId
- this is safe to do because the `dictId`s the `groupId`s derive from are defined within the scope of a segment. An aggregation function is invoked for the aggregated value at each docId
whenever the groupId
at that docId
is repeated. The results are stored in a GroupByResultHolder
.
3. Server level aggregation - `dictId`s (and therefore `groupId`s) are not well defined beyond the scope of a segment so they can't be used beyond the scope of a segment. The `GroupKeyGenerator`s (which know how to map the `groupId`s to the actual group keys) and the GroupByResultHolder
s (which map groupId
s to partial aggregates) are combined in an AggregationGroupByResult
which is then post-processed, the `groupId`s are translated to groupKeys
(this is where Dictionary.getInternal
is called), and then the partial aggregates are aggregated and upserted into an IndexedTable
Richard Startin
03/21/2022, 11:37 AMDataFetcher
retrieves blocks of dictId
s and this does happen earlier in the evaluation when there is a dictionary present, but the DataFetcher
can't produce group keys valid across segments
• "This does not go through the datafetcher and hence causes increased latency." - this is the kind of statement I would only make having done an experiment with some numbers in hand.
• "Is there a way around this?" - seek solutions once it's confirmed there is a problem. 😄Ashish
03/21/2022, 2:21 PMRichard Startin
03/21/2022, 2:25 PMgetInternal
is fast, but that building group keys is completely unrelated to the purpose of the DataFetcher
Richard Startin
03/21/2022, 2:29 PMRichard Startin
03/21/2022, 2:31 PMAshish
03/21/2022, 2:35 PMRichard Startin
03/21/2022, 2:40 PMRichard Startin
03/21/2022, 2:42 PMDataFetcher
retrieves blocks of values indexed by docId not by groupIdRichard Startin
03/21/2022, 2:52 PM| dim1 | dim2 | metric |
|-------|-------|--------|
| "a1" | "b1" | 10 |
| "a1" | "b2" | 11 |
| "a2" | "b1" | 12 |
| "a2" | "b4" | 9 |
| "a1" | "b1" | 8 |
then a call to the DataFetcher.getStringValuesSV(dim1)
would retrieve
"a1", "a1", "a2", "a2", "a1"
and ``DataFetcher.getStringValuesSV(dim2)` would retrieve
"b1", "b2", "b1", "b4", "b1"
but there are only 4 groups:
("a1", "b1"), ("a1", "b2"), ("a2", "b1"), ("a2, "b4")
the calls to Dictionary.getInternal
take place when mapping dim1:0 to "a1", dim1:1 to "a2", dim2:0 to "b1", dim2:1 to "b2", dim2:3 to "b4" (no group has "b3"), basically it's a completely different operation to extracting the values for each docId.
One thing that could be improved is if dim1 has cardinality x, and dim2 has cardinality y, you will extract each dictionary value from dim1 up to y times, and each from dim2 up to x times, so if the cardinalities are low enough it makes sense to cache the raw values indexed by dictionary id (not docId)Ashish
03/21/2022, 6:36 PMRichard Startin
03/21/2022, 7:22 PMRichard Startin
03/21/2022, 7:22 PMAshish
03/23/2022, 2:24 PM