In the <article> `Pinot: Realtime OLAP for 530 Mil...
# general
a
In the article
Pinot: Realtime OLAP for 530 Million Users
it says
Copy code
At Linkedin, business events are published in Kafka streams and
are ETL'ed onto HDFS. Pinot supports near-realtime data ingestion by reading events directly from Kafka [19] as well as data
pushes from offline systems like Hadoop. As such, Pinot follows
the lambda architecture [23], transparently merging streaming data
from Kafka and offline data from Hadoop. As data on Hadoop is a
global view of a single hour or day of data as opposed to a direct
stream of events, it allows for the generation of more optimal segments and aggregation of records across the time window.
Is there a general rule of thumb of when should I keep raw events in Pinot vs aggregated data?
k
My rule of thumb would be “Use raw events until you are convinced that won’t work”. Try to keep everything as raw events, with appropriate indexes (especially star trees). If and only if you can’t get that to work (latency is too high, storage requirements are too big, etc) then look at doing pre-aggregations. I’m giving a Pinot talk on June 22nd, and one of the items is how we worked around what looked like a pre-aggregation requirement.
👍 2
👍🏼 1