Currently `DistinctCountHLL` only works for single...
# pinot-dev
k
Currently
DistinctCountHLL
only works for single value fields. It seems like a simple change in
DistinctCountHLLAggregationFunction.aggregate()
to check if the
BlockValSet
is multi-valued, and if so then call
BlockValSet.getXXXMV()
and do a sub-iteration on the secondary array it returns. Does that make sense?
k
Surprised that’s it’s not supported as of now
k
If you run this query on a MVF, you get:
Copy code
"message": "QueryExecutionError:\njava.lang.UnsupportedOperationException\n\tat org.apache.pinot.core.segment.index.readers.ForwardIndexReader.readDictIds(ForwardIndexReader.java:84)\n\tat org.apache.pinot.core.common.DataFetcher$ColumnValueReader.readStringValues(DataFetcher.java:439)\n\tat org.apache.pinot.core.common.DataFetcher.fetchStringValues(DataFetcher.java:146)\n\tat org.apache.pinot.core.common.DataBlockCache.getStringValuesForSVColumn(DataBlockCache.java:194)\n\tat org.apache.pinot.core.operator.docvalsets.ProjectionBlockValSet.getStringValuesSV(ProjectionBlockValSet.java:94)\n\tat org.apache.pinot.core.query.aggregation.function.DistinctCountHLLAggregationFunction.aggregate(DistinctCountHLLAggregationFunction.java:103)\n\tat org.apache.pinot.core.query.aggregation.DefaultAggregationExecutor.aggregate(DefaultAggregationExecutor.java:47)\n\tat org.apache.pinot.core.operator.query.AggregationOperator.getNextBlock(AggregationOperator.java:66)\n\tat org.apache.pinot.core.operator.query.AggregationOperator.getNextBlock(AggregationOperator.java:35)\n\tat org.apache.pinot.core.operator.BaseOperator.nextBlock(BaseOperator.java:49)\n\tat org.apache.pinot.core.operator.combine.BaseCombineOperator$1.runJob(BaseCombineOperator.java:94)\n\tat org.apache.pinot.core.util.trace.TraceRunnable.run(TraceRunnable.java:40)\n\tat java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\n\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)"
I’ll file an issue and generate a PR
m
@Ken Krugler can you try
distinctCountHLLMV
?
Aggregation functions on MV columns have an
MV
suffix in the name.
k
@Mayank Thanks for clarifying, I was confused by seeing
aggregate
,
aggregateGroupBySV
, and
aggregateGroupByMV
. Made me think there was a missing
aggregateMV
function. I see now that the
BySV
and ByMV` methods are for doing aggregations when the grouping column is SV vs. MV.
m
👍
k
@Mayank But why does there need to be a different function? In the implementations the function signatures are the same, and (I assume) the
BlockValSet
could be used to determine whether to handle it as an SV or an MV column.
m
Yeah, in future, we might merge the two.
k
OK, I’ll change my issue description 🙂
m
sounds good