Hi all, I work at an analytics start-up in Sydney Australia who've just landed series-A investment. And so we're now re-architecting a particular product for scale and to support more features - so going from an MVP monolith with a PostgresDB to a more distributed system.
I'm looking to understand how Pinot may fit into a new stack and would like to know from an operations perspective how demanding a Pinot deployment might be, and how we might have to scale our engineering teams to support such a deployment in production.
If the context helps, it's a multi-tenant application that orchestrates some ETL+ML data-pipelines with Spark/Databricks and the Gurobi optimiser. But essentially the output are "results" datasets (parquets on Azure data lake) comprised of 50M~100M rows and 30~40 columns. We're aiming at supporting at least 500 users spread over several customers/tenants. We expect concurrent users to be generating/experimenting with new datasets regularly throughout the day (hourly). One of the web applications will be a front-end with a UI "data-grid" where users will want to perform exploratory/interactive analysis on results, so aggregations/group-by/filtering/search/counting etc, at reasonably-low latency.
On paper, Pinot looks like a great fit, but is it overkill for us? How many engineers would it take to support a deployment for our volume of data/ingestion? Note that ZooKeeper is not an existing part of our stack yet. Sorry for the wall of text. Any advice or experience from others here would be greatly appreciated. Cheers.