Hi guys Hey I know that Pinot supports ScalarFunction but is Apache Pinot #troubleshooting

Hi guys! Hey, I know that Pinot supports ScalarFun...

Diogo Baeder

02/27/2022, 3:56 PM

Hi guys! Hey, I know that Pinot supports ScalarFunction, but is it possible to add custom aggregation functions somehow? How can I aggregate data in a way that I can have logic where rows are inter-dependent?

🍷 1

Diogo Baeder

02/27/2022, 3:59 PM

The reason why I'm asking this is because I have a use case where I need to aggregate users requests into buckets that have a minimum distance of 30 minutes between each other (it's how we define "user sessions" in our project), but I'm currently doing this in Python and it involves a lot of I/O from Pinot to my application, so I was wondering whether it would be possible to push this logic and have the aggregation done in the Pinot cluster itself

Mayank

02/27/2022, 4:13 PM

See if you can use groovy functions for that

Mayank

02/27/2022, 4:14 PM

What is the response data size that you are taking out from Pinot? 40s seems unusually large for a 500ms query

Diogo Baeder

02/27/2022, 4:42 PM

I was able to bring it down to ~12s, where ~10s from that is just for transferring from Pinot into my application (AWS EU datacenter into on-premises US datacenter - so it's an understandable bottleneck). I haven't checked the response data size yet, but I'll check, I bet it's quite a few MB. How is Groovy able to cope with aggregating rows though? I understand how it can process one or more columns in a row, but I don't understand how it can work with multiple rows in the same call.

Mayank

02/27/2022, 4:47 PM

Yeah may be not groovy. Will grouping by 30min buckets help? If so, there’s the datetime conversion udf that can be used. If not, Can you elaborate the requirement a bit?

Diogo Baeder

02/27/2022, 4:54 PM

The problem is that it would be buckets fixed in time, so, for example, a request done at 142959 would be on a different session than another one made by the same user at 143001. The idea is that, as soon as a new request is iterated over but was has more than 30 mins of difference from the last request, it would be aggregated as a different session. Let me elaborate a bit better. Imagine these requests by the same user, done in the same day, at times: 1. 142959 2. 143001 3. 144500 4. 152000

Diogo Baeder

02/27/2022, 4:55 PM

What we want is to group those 3 first requests into the same session, containing 3 visits, while the 4th request would be a second session with only 1 visit.

Diogo Baeder

02/27/2022, 4:57 PM

Another scenario: imagine the same user making 100 requests, one every minute. We want to group all those 100 requests as a single session, containing 100 visits, no matter how much time in total the user has been in that session.

Diogo Baeder

02/27/2022, 4:58 PM

The breaking point, for us, is the time between consecutive requests for the user: if it's more than 30 minutes, we break up into a different session.

Mayank

02/27/2022, 5:00 PM

Yeah seems like window function, which Pinot doesn’t have right now. In the absence of that, the only options I can think of are upstream processing to mark sessions, or client side processing.

Mayank

02/27/2022, 5:01 PM

For client side why is the response size huge? What’s the query and can you process batches in parallel?

Diogo Baeder

02/27/2022, 5:19 PM

The response size is so huge because of the amount of requests we get. But, honestly, I need to pin down how much data is being transferred. And we have a plan to migrate our app into AWS@EU too, so the latency would be dramatically reduced, which would cut down a lot of that time. Processing in parallel would be a bit too complicated to implement because we would have to rely on something else than Python (because of the GIL - multithreading is not efficient at all when it comes to parallel computations in Python), but that's not the biggest issue we have, the biggest one is I/O.

Mayank

02/27/2022, 5:57 PM

By parallel I meant that if there are batches of queries that are non overlapping you can simply fire multiple queries in parallel and just concat the results. That would not require complex MT code in python. However, if the plan is to migrate the app to same aws region, that would be ideal.

Diogo Baeder

02/27/2022, 6:01 PM

Got it. Yeah, we hope that can happen soon. For now, though, it's not a huge issue, just something I want to optimize for greater UX. I mean, we have now a use case that responds in ~12s whereas in our old system that used to take like at least 10 minutes! So not much to complain there... 😄

Diogo Baeder

02/27/2022, 8:23 PM

Man, your idea about using buckets is actually not bad! Actually, I already do that for putting them on a minute-resolution bucket aggregation, so I was thinking, putting them in 10-minute buckets would be fine too, because there's just a slight chance I end up getting visits on a same session when they should actually be on different sessions.

Diogo Baeder

02/27/2022, 8:35 PM

Actually, there's a slight difference that happens in the numbers we get, but it might not be a problem.

👍 1

Mayank

02/27/2022, 8:42 PM

Sounds good

Diogo Baeder

02/27/2022, 8:42 PM

When I do that and use

ToEpochSecondsRounded("timestamp", 600

instead of

ToEpochSecondsRounded("timestamp", 60)

, I'm able to bring the time to load the data into my app from ~10s to ~6s, and post-processing from ~4s to ~0.8s, which is quite a bit.

Diogo Baeder

02/27/2022, 8:44 PM

Ah, by the way, we still do use Python's

datetime

instances in our app, but now that I'm getting seconds back from Pinot instead of date-time strings, I'm converting them directly from epoch seconds, which is way faster (as would be expected in most languages, since usually processing numbers is faster than processing strings, in general).

Open in Slack

Previous Next