https://pinot.apache.org/ logo
#general
Title
# general
m

Mark.Tang

01/06/2021, 2:13 AM
• Hi Team, I have seen that in 0.4.0, pinot has implemented the initial version of theta-sketch based distinct count aggregation function, utilizing the Apache DataSketches library. Compared to Druid the latest release which has also included DataSketches extension(Theta sketch, Tuple sketch, Quantiles sketch ,HLL sketch), pinot has any plan to implement other sketchs other than Theta sketch). Thanks.
m

Mayank

01/06/2021, 2:15 AM
Pinot already supports HLL and TDigest based percentiles. If there's a specific case where you would find DataSketch based implementations more useful, we can definitely explore that. If so, would recommend filing an issue for that.
👍 2
For HLL we use
com.clearspring.analytics.stream.cardinality.HyperLogLog
🙌 1
And for TDigest, we use
com.tdunning.math.stats.TDigest
🙌 1
m

Mark.Tang

01/06/2021, 2:25 AM
Thanks for quick reply!
m

Mayank

01/06/2021, 2:26 AM
👍
m

Mark.Tang

01/06/2021, 2:57 AM
@Mayank we maybe need to pay attention to KLL sketch vs t-digest(pinot impmentation) and seeing the following comparison by datasketches, https://datasketches.apache.org/docs/Quantiles/KllSketchVsTDigest.html
m

Mayank

01/06/2021, 3:05 AM
Thanks for sharing @Mark.Tang. We can definitely explore adding these if needed.
Also noting that DataSketches includes a latest CPC Sketch: Estimating Stream Cardinalities more efficiently than the famous HLL sketch, which is from https://arxiv.org/pdf/1708.06839.pdf
m

Mayank

01/06/2021, 3:19 PM
If you could open an issue and add all this there, it would help us track this request @Mark.Tang
m

Mark.Tang

01/07/2021, 1:35 AM
I will try to open an issue to discuss sketches family @Mayank
m

Mayank

01/07/2021, 1:35 AM
Thanks @Mark.Tang.
m

Mark.Tang

01/07/2021, 7:21 AM