Channels

#general

Title

# general

l

Lee Rhodes

01/26/2021, 8:47 PMHello, this is Lee Rhodes of Apache DataSketches. I would be interested in some feedback as to how you are using our library … what sketches are you using, what you feel works well, and constructive feedback on what could work better, or what problems you would like sketches to address?

❤️ 1

m

Mayank

01/26/2021, 9:29 PM

Hello **@Lee Rhodes**. One of our count-distinct functions has a Theta-Sketch based implementation (we have an HLL based function as well). We like theta-sketches for its capability to perform set operations. However, our biggest challenge is getting accuracy under control (especially, in case of intersection of uneven sized sets). Anything you guys can do for that would be really helpful.

l

Lee Rhodes

01/26/2021, 10:48 PMYes, accuracy of distinct counts of intersections (and differences) of sampled sets are difficult. And it is not a shortcoming of the algorithm, per se. It can be proven mathematically that no matter what streaming algorithm you use, if you end up with sampled sets of the original domain the accuracy of your estimate can be very poor compared with the accuracy of a union operation.
We knew this from the outset. This is why we provide the getUpperBound() and getLowerBound() methods that you can use as a tool to warn you, after the fact, if the accuracy goes beyond what you consider to be acceptable.
For example, with a theta sketch configured with K=4096 ( logK=12), its accuracy on a single stream or from a merge (union) will asymptote to about +/- 3.1 % with 95% confidence:
(2 / sqrt(4096)) = .03125. What you can do: after any intersection or difference operation check to see how much the expected error has changed by computing

`((getUpperBound(2) / getEstimate()) -1) * sqrt(K)/2`

. This will be factor of how much your intersection error exceeds the nominal RSE of the sketch. If this results in a 2, that means your estimated error of that operation will be about twice as large or, in this case, about +/- 6.25% (at 95% confidence).
At least this allows you to monitor the relative error of intersections and even be able to determine which operations caused the largest increase in error.
You can also try scheduling the sequence of your set operations so that all of your intersections occur either early in the sequence or and the end. Depending on your data, you might find that reordering the sequence might help.
Other than that, know that the intersection error of the theta sketches approaches the theoretical limit of what is possible, given a streaming algorithm and limited space.
I hope this helps.m

Mayank

01/26/2021, 10:49 PM

Yes, this is helpful. Thanks

l

Lee Rhodes

01/26/2021, 10:52 PMDo you have any interest in some of the other sketches: quantiles, frequent items, etc?

m

Mayank

01/26/2021, 11:05 PM

The interest is usually generated by Pinot users. Once we see our users asking for these, we are happy to add those into Pinot.

l

Lee Rhodes

01/26/2021, 11:12 PMWe have found that very few system users are even aware that these capabilities exist. We would be glad to work with you to promote the possible leveraging of our DataSketches library to your users. There are lots of ways to do this.

👍 1

k

Ken Krugler

01/26/2021, 11:27 PMHi **@Lee Rhodes** I haven’t spent any time seriously thinking about this, but I always wondered if there was a faster way to approximate LLR (log-likelihood ratio) using sketch-like methods (other than just using sketches for approximate counts). I’ve found LLR to be a very useful way to surface outliers in a dataset, but doing the exact computation (say, via map-reduce) can be painful.

m

Mayank

01/26/2021, 11:29 PM

We recently solved a audience matching use case at LinkedIn using Data Sketches impl in Pinot. We talked about it in one of our meetups, and I am in the process of publishing a Lnkd blog on the same.

Happy to collaborate

l

Lee Rhodes

01/26/2021, 11:30 PMI’ll have to do some research on LLR. Nonetheless, we have used both Frequent Items and Quantiles for finding outliers as well.

We would be glad to help you with your blog and or meetups with materials, tutorials. Let us know how we can help.

👍 1

k

Karin Wolok

01/27/2021, 1:05 AM

And if you do anything like that, keep us posted - we'd be happy to cross publish / promote 🙂

l

Lee Rhodes

01/27/2021, 1:08 AMWe are actually preparing a Press-release with ASF about our recent graduation. It would be great if you folks could give us a couple of sentences of how useful DataSketches has been for Pinot!

Something with the format: “QUOTE,” said NAME, TITLE at COMPANY. “…MORE…”

k

Ken Krugler

01/27/2021, 3:12 PMl

Lee Rhodes

01/27/2021, 5:51 PMk

Karin Wolok

01/28/2021, 1:03 AM

Hi **@Lee Rhodes**! I did now! haha. Threads on Slack don't notify you, so it's difficult to keep track.. The community is still kind of young and I can't really pinpoint exactly who would be a good person to get a quote from because I don't know what everyone has in their tool kit. I would have suggested to you to post in this channel to identify them. I also asked a couple colleagues that I know to see if they know anything about it. Waiting on an answer

k

Ken Krugler

01/28/2021, 1:08 AMk

Karin Wolok

01/28/2021, 1:12 AM

Hi **@Ken Krugler** I did see that! 😃
I think **@Lee Rhodes** is looking to identify more people if I am not mistaken? I am still such a newb myself and still getting to know people, so slightly unhelpful here. I do not know anyone off the top of my head (wouldn't have know **@Mayank** if he didn't speak up). I would imagine in the general slack channel might be the best place to find those folks. I did ask some people who know the community well though, to see if they have any ideas.

Always a fun opportunity to get quoted in media talking about cool tech. haha

l

Lee Rhodes

01/28/2021, 4:48 AMA quote from **@Mayank** would be great or even Subbu. He and I were together at Yahoo!

m

Mayank

01/28/2021, 4:48 AM

I'll DM you **@Lee Rhodes**

l

Lee Rhodes

02/02/2021, 6:52 PMCheers,
Lee.

k

Ken Krugler

02/02/2021, 7:03 PMHi **@Lee Rhodes** thanks for the update. There is the assumption of binomial distribution for LLR to be meaningful, good point. Also thanks for the “concomitant guarantees” phrase, I plan to drop that into a Keynote presentation sometime soon :)

l

Lee Rhodes

02/02/2021, 7:04 PM🙂