Hi everyone, I’m looking to leverage Pinot in a simple Analytics use-case: allowing distinct counts, funnel analysis and anomaly detection of user click events from our App
Currently, our somewhat large company we are ingesting 100~200K events/second of 300 different (but defined) schemas , the biggest schema should have 40 columns but the majority are less than 20. In this mix, there are also at least 10% of late-events and duplicates. (2TB a day)
Currently we reach for more than 500 users querying in an exploratory/interactive fashion over this data in our own front-end. With Pinot we hope to achieve sub-minute latency.
Pinot looks the perfect fit for this use-case, since there is no need to join events, but my main doubt is how big should this infrastructure to support this volume? And how hard is going to support a deployment for this volume?
I’m planning on deploying with K8s using S3 as segment store for Pinot. I also don’t need the Offline Server or any batch ingestion job
10/01/2020, 3:21 PM
funnel analysis is definitely a complex use case but looks like you are ok with sub-minute latency. For cost-efficiency and better latency, you will have to use some advanced features like partitioning and ensuring all segments of a partition reside on the same node. We have recently added some udf's to support funnel analysis use case which might also benefit you.
While batch server is not needed in practice, having batch server allow's you to reorganize the data which will improve the efficiency further.
My suggestion would be to start with a weeks data (14TB) and measure the benchmark. Note that Pinot has 10 second as default time out. and you might have to increase this for your usecase. Once you have the data in and baseline perf numbers we can guide you on optimizing the perf.
10/01/2020, 4:22 PM
Thank you for your detailed answer @Kishore G!
I thought funnel analysis could be easier, but I can totally live without it for the time being.
I’ll move things around here and come back here once I get some benchmark results.
Do you think using S3 as FS could be a problem for performance?
10/01/2020, 4:30 PM
There are many variations in funnel analysis. Some of them are straightforward group by style queries while others might need sub query or windowing. Without knowing the details on your use case, it’s hard to say which one is needed in your use case.
S3 as FS will probably not work well given your requirement
Happy to create a channel and discuss further
10/01/2020, 4:48 PM
I’ll get my hands dirty before bothering you further 😉
Thanks for the help so far!