https://pinot.apache.org/ logo
#general
Title
# general
s

Siavash Mahmoudian

11/22/2020, 6:12 AM
Hi Everyone! 👋 I’m co-founder of an online community platform startup. We help companies create customizable white-labeled social networks to connect their audience together. Apache Pinot looks amazing and we want to use Apache Pinot mostly for our analytics and user segmentation/filtering. Using Pinot for analytics is a no-brainer. However, I’m not sure if we should ElasticSearch or Apache Pinot for user filtering. To give you more context, in our platform users can take different actions such as “Creating a post”, “Liking a post”, “Commenting on a post”, “Buying an item”, etc. and they have different properties such as “Title”, “Age”, “Last Seen At”, etc. An example of user filtering is to fetch all users who have more than 5 posts and 10 comments and their age is more than 21 and were seen in the last 10 days. We should be able to sort the results on different columns of the user and paginate the results. Now here are my two questions: 1. If we want to use Pinot for user filtering, we will need to set the data retention period to infinite since the filters can be applied to any time period including from the beginning. Does Pinot slow down based on the amount of data it stores over time? Should we think of running cron jobs every month for instance to convert all the very old records to one or there is no need for it? 2. If we want to do filters on the number of actions (Buying an item), action fields (The amount of the item that was bought) and user fields (there can be custom fields defined). This means each record that we want to insert will have many columns. For instance for the “Buying an item” example, we need to save all the properties of the buyer, the product, the price. For other actions, we will need to save other properties. This means the number of columns can end up to hundreds. Is Apache Pinot designed to handle tons of columns in the schema? Thanks in advance for the help!
k

Kishore G

11/22/2020, 6:56 PM
Thanks @Siavash Mahmoudian for the interest in Pinot.
If your need is to only keep the aggregates, its more economical to do periodic aggregations. There is a framework in Pinot (Minion) that helps with this.
Hundreds of columns will not be a problem. The metadata will be bigger for a segment but thats about it.
we do store the list of columns in memory (vs data which is mmapped), so if you have lot of segments and lot of columns the memory required might need to increase. You can think of 100kb memory requirement per segment.
So it wont be a lot
s

Siavash Mahmoudian

11/22/2020, 7:04 PM
@Kishore G Thanks for making such a great product. The best case would be to have the ability to create queries for any time period. But if that would slow things down, we’d be able to aggregate old content and limit users to filtering up to “1 year ago” OR “All time”. Just wanted to make sure having hundreds of columns will not slow down the queries especially when we do group by. 100kb for each segment feels very reasonable. So it seems overall there won’t be any issues. Thanks!
k

Kishore G

11/22/2020, 7:06 PM
if you need to keep individual records but also need speed, you can use star tree indexing
👀 1
s

Siavash Mahmoudian

11/22/2020, 7:10 PM
1 more question, let’s say in a social feed people can create posts, comment on posts and react to posts similar to LinkedIn. Now, we want to show on user profile how many of each activity they’ve done. In the traditional approach we would store counts for every single activity in the user record. For instance
postsCount
,
likesCount
,
commentsCount
, etc. Based on the fact that aggregations are super fast on Pinot. Does it make sense not to store these on user’s records anymore? Or it’s still better to store it on user’s records as a cache and update it whenever we hit Apache Pinot?
k

Kishore G

11/22/2020, 9:19 PM
With periodic aggregation, it’s ok to serve this directly from Pinot.
🙏 1
f

Farnood Massoudi

11/23/2020, 9:35 AM
Thanks Kishore for your replies. I’m not sure if this is the right place to ask these questions, but I wonder if you know: 1- If we use
SUM(CASE WHEN ...)
as far as I can tell, Star-tree cannot help us. But what I don’t understand is that what happens if we filter the query by for example a DateTime? Smart-tree can handle it? 2- For example, we want to filter our result based on another table, we should get the IDs from that table and use
external_id IN (IDs)
. If the (IDs) count gets huge, would it be a problem? By huge I mean for example 1 million or more. And if it is a problem, can Presto with join operation handle it?
k

Kishore G

11/23/2020, 3:38 PM
You are right, case statement will not work with star tree. For the large in clause, Presto will work. there are few tricks in Pinot you can use .We added few. special udf to handle this. @Jackie ^^
👍 1
f

Farnood Massoudi

11/23/2020, 5:59 PM
Where can I see these special udfs? I cannot find them in the documents. Also, is there a benchmark to show the performance and latency differences between Pinot and Pinot+Presto?
j

Jackie

11/23/2020, 6:35 PM
@Farnood Massoudi Here is the design of the special UDF that you can try: https://docs.google.com/document/d/1s6DZ9eTPqH7vaKQlPjKiWb_OBC3hkkEGICIzcd5gozc/edit?usp=sharing
It's still in beta version, so only the design doc is linked in the documents. Feel free to try it out
👍 2
🙏 2