Hello, this is Lee Rhodes of Apache DataSketches. ...
# general
l
Hello, this is Lee Rhodes of Apache DataSketches. I would be interested in some feedback as to how you are using our library … what sketches are you using, what you feel works well, and constructive feedback on what could work better, or what problems you would like sketches to address?
❤️ 1
m
Hello @Lee Rhodes. One of our count-distinct functions has a Theta-Sketch based implementation (we have an HLL based function as well). We like theta-sketches for its capability to perform set operations. However, our biggest challenge is getting accuracy under control (especially, in case of intersection of uneven sized sets). Anything you guys can do for that would be really helpful.
l
Yes, accuracy of distinct counts of intersections (and differences) of sampled sets are difficult. And it is not a shortcoming of the algorithm, per se. It can be proven mathematically that no matter what streaming algorithm you use, if you end up with sampled sets of the original domain the accuracy of your estimate can be very poor compared with the accuracy of a union operation. We knew this from the outset. This is why we provide the getUpperBound() and getLowerBound() methods that you can use as a tool to warn you, after the fact, if the accuracy goes beyond what you consider to be acceptable. For example, with a theta sketch configured with K=4096 ( logK=12), its accuracy on a single stream or from a merge (union) will asymptote to about +/- 3.1 % with 95% confidence: (2 / sqrt(4096)) = .03125. What you can do: after any intersection or difference operation check to see how much the expected error has changed by computing
((getUpperBound(2) / getEstimate()) -1) * sqrt(K)/2
. This will be factor of how much your intersection error exceeds the nominal RSE of the sketch. If this results in a 2, that means your estimated error of that operation will be about twice as large or, in this case, about +/- 6.25% (at 95% confidence). At least this allows you to monitor the relative error of intersections and even be able to determine which operations caused the largest increase in error. You can also try scheduling the sequence of your set operations so that all of your intersections occur either early in the sequence or and the end. Depending on your data, you might find that reordering the sequence might help. Other than that, know that the intersection error of the theta sketches approaches the theoretical limit of what is possible, given a streaming algorithm and limited space. I hope this helps.
m
Yes, this is helpful. Thanks
l
Do you have any interest in some of the other sketches: quantiles, frequent items, etc?
m
The interest is usually generated by Pinot users. Once we see our users asking for these, we are happy to add those into Pinot.
l
We have found that very few system users are even aware that these capabilities exist. We would be glad to work with you to promote the possible leveraging of our DataSketches library to your users. There are lots of ways to do this.
👍 1
k
Hi @Lee Rhodes I haven’t spent any time seriously thinking about this, but I always wondered if there was a faster way to approximate LLR (log-likelihood ratio) using sketch-like methods (other than just using sketches for approximate counts). I’ve found LLR to be a very useful way to surface outliers in a dataset, but doing the exact computation (say, via map-reduce) can be painful.
m
We recently solved a audience matching use case at LinkedIn using Data Sketches impl in Pinot. We talked about it in one of our meetups, and I am in the process of publishing a Lnkd blog on the same.
Happy to collaborate
l
I’ll have to do some research on LLR. Nonetheless, we have used both Frequent Items and Quantiles for finding outliers as well.
We would be glad to help you with your blog and or meetups with materials, tutorials. Let us know how we can help.
👍 1
k
And if you do anything like that, keep us posted - we'd be happy to cross publish / promote 🙂
l
We are actually preparing a Press-release with ASF about our recent graduation. It would be great if you folks could give us a couple of sentences of how useful DataSketches has been for Pinot!
Something with the format: “QUOTE,” said NAME, TITLE at COMPANY. “…MORE…”
@Ken Krugler we have our Research forum on Tuesdays, and I will ask our scientists your question about LLR next week.
k
@Lee Rhodes - thanks, looking forward to what they have to say.
l
@Karin Wolok Hi Karin, did you see my quote request?
k
Hi @Lee Rhodes! I did now! haha. Threads on Slack don't notify you, so it's difficult to keep track.. The community is still kind of young and I can't really pinpoint exactly who would be a good person to get a quote from because I don't know what everyone has in their tool kit. I would have suggested to you to post in this channel to identify them. I also asked a couple colleagues that I know to see if they know anything about it. Waiting on an answer
k
@Karin Wolok - Seems like@Mayank knows how Pinot uses DataSketches…
k
Hi @Ken Krugler I did see that! 😃 I think @Lee Rhodes is looking to identify more people if I am not mistaken? I am still such a newb myself and still getting to know people, so slightly unhelpful here. I do not know anyone off the top of my head (wouldn't have know @Mayank if he didn't speak up). I would imagine in the general slack channel might be the best place to find those folks. I did ask some people who know the community well though, to see if they have any ideas.
Always a fun opportunity to get quoted in media talking about cool tech. haha
l
A quote from @Mayank would be great or even Subbu. He and I were together at Yahoo!
m
I'll DM you @Lee Rhodes
l
@Ken Krugler I brought up your question during our science meeting today and I have some information for you. Log Likelihood functions make fundamental assumptions about what the distribution of your data is. For example, Gaussian, Lognormal, power-law, Poisson, etc. And, if your assumption about how your data is distributed is relatively close to how the actual data is distributed, then LLR functions can be used for identifying outliers. Computing LLR can be feasible in a streaming context, but it depends on the exact LLR function. You should be aware that the very definition of what constitutes an “outlier” is also making an assumption about the underlying distribution of your data. If the underlying distribution of your data changes, then what you thought was an “outlier” may no longer be one. There are sketches described in the theoretical literature that try to estimate the underlying distribution dynamically by computing the moments. These moments can then be used to estimate quantiles, which also can be used to identify “outliers”. However, we have stayed away from implementing these types of sketches because they can fail miserably if your data is noisy or not well behaved. And they fail without warning! We have a demonstration of that on our website. All of our sketches are by design, “distribution independent”. Which means the sketch will perform well no matter what the incoming distribution is. So, yes, it is possible to implement LLR functions in streaming algorithms (sketches), but be forewarned: It would be impossible to give any guarantees about their performance without concomitant guarantees about the incoming data distribution.
Cheers, Lee.
k
Hi @Lee Rhodes thanks for the update. There is the assumption of binomial distribution for LLR to be meaningful, good point. Also thanks for the “concomitant guarantees” phrase, I plan to drop that into a Keynote presentation sometime soon :)
l
🙂