Hi all, I work at an analytics start-up in Sydney ...
# general
Hi all, I work at an analytics start-up in Sydney Australia who've just landed series-A investment. And so we're now re-architecting a particular product for scale and to support more features - so going from an MVP monolith with a PostgresDB to a more distributed system. I'm looking to understand how Pinot may fit into a new stack and would like to know from an operations perspective how demanding a Pinot deployment might be, and how we might have to scale our engineering teams to support such a deployment in production. If the context helps, it's a multi-tenant application that orchestrates some ETL+ML data-pipelines with Spark/Databricks and the Gurobi optimiser. But essentially the output are "results" datasets (parquets on Azure data lake) comprised of 50M~100M rows and 30~40 columns. We're aiming at supporting at least 500 users spread over several customers/tenants. We expect concurrent users to be generating/experimenting with new datasets regularly throughout the day (hourly). One of the web applications will be a front-end with a UI "data-grid" where users will want to perform exploratory/interactive analysis on results, so aggregations/group-by/filtering/search/counting etc, at reasonably-low latency. On paper, Pinot looks like a great fit, but is it overkill for us? How many engineers would it take to support a deployment for our volume of data/ingestion? Note that ZooKeeper is not an existing part of our stack yet. Sorry for the wall of text. Any advice or experience from others here would be greatly appreciated. Cheers.
Congratulations! 50 to 100 million is not large. While Pinot can definitely handle it, will be good to understand the issues you are facing with existing setup.
Regarding operating Pinot, it definitely has a learning curve. Integration with K8s does make it easier. To give you an idea, 2 to 3 SREs at LinkedIn manage 1000+ nodes.
Zookeeper is not hard to manage/operate as long as you follow the guidelines
Hey thanks for the reply. Currently for MVP we have a single write-master and a couple of read replicas (postgres). I believe for our dataset volumes we are getting average OLAP-style query latency of 1.7 seconds which is far too slow. I think there are some "legacy design" issues with the current data model contributing to this. Also there's a lot of contention on bulk import of datasets. I'll gather some metrics and more details this week.
Since we're exploring options - at first glance a columnar OLAP store seemed ideal as we could treat our parquets on data lake storage as our source-of-truth (and we'll never need to modify these datasets). We'd also like to have that low-latency potential, to open up other workflow opportunities in future and to handle any surges in data volume (which is possible for the big-retail sector we serve)
But what do you think? Is Pinot something you'd look at first? What kind of latencies with Pinot could we expect for 50M->100M->200M datasets under load? and roughly how many nodes would we need for that?
You can get milliseconds latency with Pinot for most queries. Go ahead and load the data. Let’s get the perf numbers and we can advise on right indexes to use. Note that indexes can be added dynamically without rebootstrapping the data