hello i m trying to figure out how pinot can support the que Apache Pinot #general

hello, i’m trying to figure out how pinot can supp...

xtrntr

07/31/2021, 12:23 AM

hello, i’m trying to figure out how pinot can support the queries i’ll like to make. my schema has the following fields:

Copy code

userId | time (15 minutes) | cell
a        0                      1
a        1                      2
a        2                      -1
a        3                      -1
a        4                      -1
a        5                      3
a        6                      3
..
z        0                      1

i’ll like to interpolate the missing values (-1), then query the result of that transformation. my understanding is that pinot doesn’t support subqueries but does so in conjunction with presto, is this the right channel to ask?

Copy code

# example of values after interpolation
userId | time (15 minutes) | cell
a        0                      1
a        1                      2
a        2                      2
a        3                      2 or 3
a        4                      3
a        5                      3
a        6                      3

# example of values after forwardFill
userId | time (15 minutes) | cell
a        0                      1
a        1                      2
a        2                      2
a        3                      2
a        4                      2
a        5                      3
a        6                      3

Mayank

07/31/2021, 12:26 AM

Yes you can use Pinot+Presto to get a full sql querying support. However, was a bit unclear on your requirement, what's the actual sql query you'd like to make?

xtrntr

07/31/2021, 12:27 AM

ah im not quite sure how to express this transformation in sql

xtrntr

07/31/2021, 12:28 AM

but the input/output should give an idea of what im trying to achieve with forwardFill/interpolation?

Ken Krugler

07/31/2021, 12:30 AM

I’m guessing you’d need to do this interpolation via an ETL process (prior to generating segments), as otherwise your performance is going to be painfully slow if you’re basically doing series transformations for each query.

xtrntr

07/31/2021, 12:32 AM

that is a good idea and something i will consider, thanks @User!

xtrntr

07/31/2021, 12:35 AM

for more context, one of the reasons i am looking into this possibility is because i’m building an analytical dashboard

xtrntr

07/31/2021, 12:36 AM

and one of the options we extend to our users is to specify the maximum gap size for interpolation (as well as different techniques to apply interpolation)

Kishore G

07/31/2021, 12:53 AM

Will this be done for a single user or across all users?

xtrntr

07/31/2021, 12:54 AM

many users, they will be allowed to specify their custom interpolation values. but its possible that offering this may be customization taken too far

xtrntr

07/31/2021, 12:54 AM

but ken’s answer has given me an idea, essentially i can get this transformation with window functions but it’ll probably be very slow

Kishore G

07/31/2021, 12:54 AM

I meant in a single query

xtrntr

07/31/2021, 12:54 AM

ah sorry. across all users

Kishore G

07/31/2021, 12:55 AM

What do you plan to do with so many data points?

xtrntr

07/31/2021, 12:56 AM

you mean after the interpolation is done?

Kishore G

07/31/2021, 12:56 AM

For e.g after interpolated/forward filling, how will the client use this data

Kishore G

07/31/2021, 12:56 AM

Yes

xtrntr

07/31/2021, 12:56 AM

a simple example would be filtering for users who have been in cell x for y timebins

Kishore G

07/31/2021, 12:58 AM

Okay. So raw data -> window function transform + filter ?

xtrntr

07/31/2021, 12:59 AM

yup, that’s one use case

xtrntr

07/31/2021, 12:59 AM

there might be one final aggregate step

Kishore G

07/31/2021, 12:59 AM

Was about to ask that next 😉

Kishore G

07/31/2021, 1:00 AM

Ok..one more question, how many rows per user is and is it possible to partition it?

Kishore G

07/31/2021, 1:01 AM

Also batch vs real-time

xtrntr

07/31/2021, 1:01 AM

so 1 user will definitively be 96 timebins (4*24) per day

xtrntr

07/31/2021, 1:02 AM

well, we actually have this dashboard up already

xtrntr

07/31/2021, 1:02 AM

but we use spark to run these analytical queries and we’re trying to see if we can support these queries for real time

xtrntr

07/31/2021, 1:07 AM

just curious, what is the direction of your questions leading down to?

Kishore G

07/31/2021, 1:11 AM

To understand how much work should be done pre vs post ingestion

👍 1

Open in Slack

Previous Next