what's a good channel to discuss more system / pla...
# general
e
what's a good channel to discuss more system / platform design oriented questions? here? https://www.startree.ai/blogs/real-time-analytics-at-scale-solving-the-trade-off-problem read this last night, i have questions - i am intrigued but skeptical that things will pan out well with the approach proposed here where we eschew data modelling / preparation and just rely on indexing in Pinot
đź‘€ 2
also, do people actually do this in practice? how do you get around not having join support? Presto / Trino?
or i Could be reading this wrong. in that what the article is saying is that you still need to have all those data models, raw, pre-agg, pre-cube but just store them all in Pinot.
image.png
Or is the right interpretation - ingest raw data, build indexes and those will serve as your pre-agg, pre-cube layers
m
Hey @Edwin Law the article gives more of an overall view. It depends on your use case on what you want to optimize for (happy to help in this regard as well).
Pinot has lookup join today (where you want to do dimension looksup). We are working on a multi-stage query execution engine, with which joins will be supported. In the interim, yes Presto/Trino is the recommendation.
No, you don’t need to have all data models, you pick the one that works for you the best.
e
So over here at Grab, we have a super extensive data lake with tons of data, most work is done there. but data is as usual not that fresh, even though we're trying to bring in things like Hudi / Delta to help with improving that
m
What’s the end use case for which you think you need something like Pinot?
e
We're looking at Pinot to be a place for people to work on calculating / retrieving realtime metrics.
m
By people do you mean your internal BI folks and data scientists?
e
we use flink for this today, but flink is of course not a great place to serve metrics
via dashboards, etc.
yes internal BI folks/ DS and via dashboards, internal operations teams that want to see the state of the world
m
Ok, for internal dashboards, you will typically have heavy write qps, but light on the read qps. Also you’d be ok with sub-second latency
e
i'm trying to figure out where to position Pinot and how we should think about it.
m
One approach would be to have denormalized data into Pinot, and serve dashbaords via it.
e
so we should do the denormalization in stream?
m
If you can’t denormalize, then you could use Pinot + Presto/Trino
Yeah, you could do it in stream.
e
how much should we invest in data modelling?
or how much do people who use pinot invest in that aspect?
m
It varies. But it is always good to get the modelling right.
How many dashboards are we talking about?
e
like do we go to the extent of the DWH models where you have fact tables, dim tables etc. or.. do people usually just go with denormalization.
we're trying to start a platform here, so initially 1? but we want to support a lot of use cases
m
I see. Let me share my experience ther eshortly
e
kinda like make this a real-time companion to the datalake?