reposting from <#C019DKYBC6P|> as adviced by <@UDR...
# troubleshooting
p
reposting from #C019DKYBC6P as adviced by @Kishore G (thank you): https://apache-pinot.slack.com/archives/C019DKYBC6P/p1661795736703099
k
Screen Shot 2022-08-29 at 3.37.59 PM.png
p
Yes, I noticed that distinct count hll is supported, and it is one of 2 main things we need. My question is about second main thing we need - quantile double sketch that https://druid.apache.org/docs/latest/development/extensions-core/datasketches-quantiles.html is used to get approximate quantiles https://druid.apache.org/docs/latest/development/extensions-core/datasketches-quantiles.html . It is part of https://datasketches.apache.org/docs/Quantiles/QuantilesOverview.html - any sketch for float, integer, and categorical (strings) types support would do with configurable accuracy and fast merge of sketches across segments and dimenions.
k
What is quantile double sketch used for? I think the udf take some parameters as well to control the accuracy
All these are simple udf that can be added easily since we already use the data sketches
You should also use star tree index
p
quantile double sketch (or other sketches for quantiles approximation, histograms, most frequent elements) like REQ, KLL, Quantile Sketch (Druid has QuantileSketches support) is used to solve the following made-up problems: • You have metrics like the latency of some events on devices, this latency is coming from millions of devices, events happen 1-1000 times per hour with many 10-1000 dimensions. The task is to have a dashboard that shows any user-defined quintile chart like p50, p90, p95 for the subset of dimension values across the past day/week/month with the granularity of hours/days. The set of quantiles is not pre-defined. Some users are interested to see histograms for latencies instead of quantiles. • Another use case is when categorical is used is used instead of numeric value. Let's have made up categorical to be error message generated on device and the goal is to get approximate frequencies of different error messages over hour/day/week/month of top N most common error messages, where N is also controlled by user and there are again 10-1000 dimensions for analysis. Error messages change frequently with new added, old removed, while some old messages persist, etc. No way to build dimension out of it. • In both made up problems data can be streaming data (Kafka) or sometimes regular batch upload for example from spark.
k
We have q digest and t-digest for percentile