Hello everyone wave Looking for some guidance on something I Apache Pinot #general

Hello everyone :wave: Looking for some guidance on...

Lars-Kristian Svenøy

01/21/2022, 11:09 AM

Hello everyone 👋 Looking for some guidance on something I’d like to do, and if it is feasible to do so with the features available in Pinot. I have a table which sees a lot of traffic, and this table is also one of the most popular tables for time-based aggregations. However, all the aggregation queries want to get a “snapshot” for a customer during certain days, not necessarily only what has happened on that day. So for example, if I wanted to summarise the state of that customer every week, I would want to have a snapshot of the most recent state for that organisation every Monday, meaning all changes since then. Will I need to create a separate table to snapshot state to accomplish this, or does someone have an idea for how I could accomplish this?

Kishore G

01/21/2022, 3:23 PM

yes. you better to create another table with type=refresh instead of default append mode

Kishore G

01/21/2022, 3:24 PM

you will have to partition the data on customer id and ensure that the segment naming is consistent every day

Kishore G

01/21/2022, 3:24 PM

every day, you recompute the data and push them to the same table

Lars-Kristian Svenøy

01/21/2022, 3:25 PM

That makes sense.. and then for “appending” new weekly entries, I simply need to make sure I don’t discard the old segments?

Lars-Kristian Svenøy

01/21/2022, 3:26 PM

I’m planning on utilising flink for all my backfill requirements

Lars-Kristian Svenøy

01/21/2022, 3:28 PM

Or does a refresh table only refresh segments if the new segment has the same name? Not entirely sure how this works in practice

Kishore G

01/21/2022, 3:29 PM

yes, new segment should have the same name

Lars-Kristian Svenøy

01/21/2022, 3:29 PM

Do I just override them in S3 then?

Kishore G

01/21/2022, 3:29 PM

table_<partitionid> is a good convention to use

Kishore G

01/21/2022, 3:29 PM

yes

Lars-Kristian Svenøy

01/21/2022, 3:30 PM

Gotcha thank you, this is helpful

Kishore G

01/21/2022, 3:30 PM

but you will have to call the pushsegment api again

Lars-Kristian Svenøy

01/21/2022, 3:30 PM

👍 How do I recover from failure in that case?

Kishore G

01/21/2022, 3:30 PM

pinot will check that CRC is different and reload it

Lars-Kristian Svenøy

01/21/2022, 3:31 PM

if the api call fails for some reason

Kishore G

01/21/2022, 3:31 PM

its idempotent, you can retry

Lars-Kristian Svenøy

01/21/2022, 3:32 PM

Yes, but not if I override them in S3 myself..

Lars-Kristian Svenøy

01/21/2022, 3:32 PM

So I take it I upload the segment to a staging bucket, and then try to upload it?

Kishore G

01/21/2022, 3:32 PM

yes

Lars-Kristian Svenøy

01/21/2022, 3:32 PM

👍 And pinot will move the segment to the s3 bucket where it should live?

Kishore G

01/21/2022, 3:32 PM

yes

Lars-Kristian Svenøy

01/21/2022, 3:33 PM

Ah awesome, it makes sense now

Lars-Kristian Svenøy

01/21/2022, 3:33 PM

thanks

Kishore G

01/21/2022, 3:33 PM

its very similar to regular upload flow, the trick is really in naming the segment

Lars-Kristian Svenøy

01/21/2022, 3:33 PM

that should be fine

Lars-Kristian Svenøy

01/21/2022, 3:34 PM

table_partition_date would work

Kishore G

01/21/2022, 3:34 PM

and ensuring you have the same number of partition

Kishore G

01/21/2022, 3:34 PM

date should not be there

Lars-Kristian Svenøy

01/21/2022, 3:34 PM

And sorting for the sorted index

Lars-Kristian Svenøy

01/21/2022, 3:34 PM

hmm

Lars-Kristian Svenøy

01/21/2022, 3:34 PM

But if I want to keep a rolling historical view by week

Lars-Kristian Svenøy

01/21/2022, 3:34 PM

I’ll need to add new segments

Kishore G

01/21/2022, 3:34 PM

ah ok

Kishore G

01/21/2022, 3:35 PM

then you are right

Lars-Kristian Svenøy

01/21/2022, 3:35 PM

Is that an append table then? But just effectively overwriting previous days?

Kishore G

01/21/2022, 3:35 PM

correct

Lars-Kristian Svenøy

01/21/2022, 3:35 PM

That makes sense

Lars-Kristian Svenøy

01/21/2022, 3:35 PM

There really isn’t much of a functional difference between an APPEND table and a REFRESH table is there?

Kishore G

01/21/2022, 3:35 PM

nope

Kishore G

01/21/2022, 3:35 PM

the only other difference is bookkeeping

Lars-Kristian Svenøy

01/21/2022, 3:36 PM

👍 thanks for answering all these questions

Lars-Kristian Svenøy

01/21/2022, 3:36 PM

Book keeping as in?

Kishore G

01/21/2022, 3:36 PM

we dont do any retention

Lars-Kristian Svenøy

01/21/2022, 3:36 PM

Ah right

Kishore G

01/21/2022, 3:36 PM

for refresh

Lars-Kristian Svenøy

01/21/2022, 3:36 PM

that makes sense

Lars-Kristian Svenøy

01/21/2022, 3:36 PM

What’s the significance of the refresh interval then?

Lars-Kristian Svenøy

01/21/2022, 3:36 PM

for minion tasks?

Kishore G

01/21/2022, 3:41 PM

more for alerting

Kishore G

01/21/2022, 3:42 PM

we used it to alert if no refresh happened in that interval

Kishore G

01/21/2022, 3:42 PM

in your case, i would go with append

2 Views

Open in Slack

Previous Next