Hello everyone :wave: Looking for some guidance on...
# general
l
Hello everyone 👋 Looking for some guidance on something I’d like to do, and if it is feasible to do so with the features available in Pinot. I have a table which sees a lot of traffic, and this table is also one of the most popular tables for time-based aggregations. However, all the aggregation queries want to get a “snapshot” for a customer during certain days, not necessarily only what has happened on that day. So for example, if I wanted to summarise the state of that customer every week, I would want to have a snapshot of the most recent state for that organisation every Monday, meaning all changes since then. Will I need to create a separate table to snapshot state to accomplish this, or does someone have an idea for how I could accomplish this?
k
yes. you better to create another table with type=refresh instead of default append mode
you will have to partition the data on customer id and ensure that the segment naming is consistent every day
every day, you recompute the data and push them to the same table
l
That makes sense.. and then for “appending” new weekly entries, I simply need to make sure I don’t discard the old segments?
I’m planning on utilising flink for all my backfill requirements
Or does a refresh table only refresh segments if the new segment has the same name? Not entirely sure how this works in practice
k
yes, new segment should have the same name
l
Do I just override them in S3 then?
k
table_<partitionid> is a good convention to use
yes
l
Gotcha thank you, this is helpful
k
but you will have to call the pushsegment api again
l
👍 How do I recover from failure in that case?
k
pinot will check that CRC is different and reload it
l
if the api call fails for some reason
k
its idempotent, you can retry
l
Yes, but not if I override them in S3 myself..
So I take it I upload the segment to a staging bucket, and then try to upload it?
k
yes
l
👍 And pinot will move the segment to the s3 bucket where it should live?
k
yes
l
Ah awesome, it makes sense now
thanks
k
its very similar to regular upload flow, the trick is really in naming the segment
l
that should be fine
table_partition_date would work
k
and ensuring you have the same number of partition
date should not be there
l
And sorting for the sorted index
hmm
But if I want to keep a rolling historical view by week
I’ll need to add new segments
k
ah ok
then you are right
l
Is that an append table then? But just effectively overwriting previous days?
k
correct
l
That makes sense
There really isn’t much of a functional difference between an APPEND table and a REFRESH table is there?
k
nope
the only other difference is bookkeeping
l
👍 thanks for answering all these questions
Book keeping as in?
k
we dont do any retention
l
Ah right
k
for refresh
l
that makes sense
What’s the significance of the refresh interval then?
for minion tasks?
k
more for alerting
we used it to alert if no refresh happened in that interval
in your case, i would go with append