hello, i’m trying to figure out how pinot can supp...
# general
x
hello, i’m trying to figure out how pinot can support the queries i’ll like to make. my schema has the following fields:
Copy code
userId | time (15 minutes) | cell
a        0                      1
a        1                      2
a        2                      -1
a        3                      -1
a        4                      -1
a        5                      3
a        6                      3
..
z        0                      1
i’ll like to interpolate the missing values (-1), then query the result of that transformation. my understanding is that pinot doesn’t support subqueries but does so in conjunction with presto, is this the right channel to ask?
Copy code
# example of values after interpolation
userId | time (15 minutes) | cell
a        0                      1
a        1                      2
a        2                      2
a        3                      2 or 3
a        4                      3
a        5                      3
a        6                      3

# example of values after forwardFill
userId | time (15 minutes) | cell
a        0                      1
a        1                      2
a        2                      2
a        3                      2
a        4                      2
a        5                      3
a        6                      3
m
Yes you can use Pinot+Presto to get a full sql querying support. However, was a bit unclear on your requirement, what's the actual sql query you'd like to make?
x
ah im not quite sure how to express this transformation in sql
but the input/output should give an idea of what im trying to achieve with forwardFill/interpolation?
k
I’m guessing you’d need to do this interpolation via an ETL process (prior to generating segments), as otherwise your performance is going to be painfully slow if you’re basically doing series transformations for each query.
x
that is a good idea and something i will consider, thanks @User!
for more context, one of the reasons i am looking into this possibility is because i’m building an analytical dashboard
and one of the options we extend to our users is to specify the maximum gap size for interpolation (as well as different techniques to apply interpolation)
k
Will this be done for a single user or across all users?
x
many users, they will be allowed to specify their custom interpolation values. but its possible that offering this may be customization taken too far
but ken’s answer has given me an idea, essentially i can get this transformation with window functions but it’ll probably be very slow
k
I meant in a single query
x
ah sorry. across all users
k
What do you plan to do with so many data points?
x
you mean after the interpolation is done?
k
For e.g after interpolated/forward filling, how will the client use this data
Yes
x
a simple example would be filtering for users who have been in cell x for y timebins
k
Okay. So raw data -> window function transform + filter ?
x
yup, that’s one use case
there might be one final aggregate step
k
Was about to ask that next 😉
Ok..one more question, how many rows per user is and is it possible to partition it?
Also batch vs real-time
x
so 1 user will definitively be 96 timebins (4*24) per day
well, we actually have this dashboard up already
but we use spark to run these analytical queries and we’re trying to see if we can support these queries for real time
just curious, what is the direction of your questions leading down to?
k
To understand how much work should be done pre vs post ingestion
👍 1